ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	2a56cc36ca	[SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings ### What changes were proposed in this pull request? Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings. ### Why are the changes needed? The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings. We should use type-annotation based `pandas_udf` or avoid specifying udf types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32913 from ueshin/issues/SPARK-35761/suppress_warnings. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 11:17:56 +09:00
Hyukjin Kwon	95f36e76c6	[SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark" ### What changes were proposed in this pull request? This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface). ### Why are the changes needed? To make it sound more natural. ### Does this PR introduce _any_ user-facing change? It fixes a typo in the unreleased changes. ### How was this patch tested? N/A Closes #32903 from HyukjinKwon/SPARK-34885. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 10:01:04 +09:00
Takuya UESHIN	ef7545b788	[SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark ### What changes were proposed in this pull request? Removes the upperbound for numpy for pandas-on-Spark. ### Why are the changes needed? We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32908 from ueshin/issues/SPARK-35759/numpy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 09:59:05 +09:00
Xinrong Meng	03756618fc	[SPARK-35616][PYTHON] Make `astype` method data-type-based ### What changes were proposed in this pull request? Make `astype` method data-type-based. Non-goal: Match pandas' `astype` TypeErrors. Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages. Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed. ### Why are the changes needed? There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32847 from xinrong-databricks/datatypeops_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-14 16:33:15 -07:00
Hyukjin Kwon	76e08a8e3d	[SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots ### What changes were proposed in this pull request? This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172. ```python ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200) ``` Before: ``` pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).; 'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))] +- Project [a#1L, b#2, c#3L] +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L] +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false ``` After: ```python Figure({ 'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}', 'name': 'a', ... ``` ### Why are the changes needed? To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns. ### Does this PR introduce _any_ user-facing change? No to end users since the changes is not released yet. Yes to dev as described before. ### How was this patch tested? Manually tested, added a test and tested in notebooks: ![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png) ![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png) Closes #32884 from HyukjinKwon/fix-hist-plot. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-12 14:36:46 +09:00
Takuya UESHIN	4d21b94d13	[SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-11 11:07:11 -07:00
itholic	ebe529e8e1	[SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents ### What changes were proposed in this pull request? This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents. ### Why are the changes needed? Since we don't use the name "Koalas" anymore. We should use "Pandas APIs on Spark" instead. ### Does this PR introduce _any_ user-facing change? Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32835 from itholic/SPARK-35591. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-11 20:42:38 +09:00
Kevin Su	cadd3a0588	[SPARK-35474] Enable disallow_untyped_defs mypy check for pyspark.pandas.indexing ### What changes were proposed in this pull request? Adds more type annotations in the file: `python/pyspark/pandas/spark/indexing.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. `./dev/lint-python` Closes #32738 from pingsutw/SPARK-35474. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 22:35:12 -07:00
Xinrong Meng	e9d60156c4	[SPARK-35705][PYTHON] Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions ### What changes were proposed in this pull request? Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions. ### Why are the changes needed? pandas had introduced bugs as below: - For pandas 1.1.3 and 1.1.4 Type error: only integer scalar arrays can be converted to a scalar index - For pandas < 1.0.4 Type error: Can only tuple-index with a MultiIndex We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32851 from xinrong-databricks/SPARK-35705. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:36:19 +09:00
Xinrong Meng	3c66c11aa6	[SPARK-35601][PYTHON] Complete arithmetic operators involving bool literals, Series, and Index ### What changes were proposed in this pull request? Completing arithmetic operators involving bool literals, Series, and Index consists of two main tasks: - Support arithmetic operations against bool literals - Support operators (+, ) between bool Series/Indexes. ### Why are the changes needed? Arithmetic operators involving bool literals, Series, and Index are incomplete now. We ought to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Newly supported operations example: ```py >>> ps.Series([1, 2, 3]) + True 0 2 1 3 2 4 dtype: int64 >>> ps.Series([1, 2, 3]) + False 0 1 1 2 2 3 dtype: int64 >>> ps.Series([True, False, True]) + True 0 True 1 True 2 True dtype: bool >>> ps.Series([True, False, True]) + False 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) True 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) * False 0 False 1 False 2 False dtype: bool >>> ps.set_option('compute.ops_on_diff_frames', True) >>> ps.Series([True, True, False]) + ps.Series([True, False, True]) 0 True 1 True 2 True dtype: bool >>> ps.Series([True, True, False]) * ps.Series([True, False, True]) 0 True 1 False 2 False dtype: bool ``` Before the change, operations above are not supported, raising a TypeError such as ```py >>> ps.Series([True, False, True]) + True Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. >>> ps.Series([True, False, True]) + False Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. ``` ### How was this patch tested? Unit tests. Closes #32785 from xinrong-databricks/datatypeops_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 15:13:03 -07:00
Hyukjin Kwon	afff42178c	[SPARK-35647][PYTHON][DOCS] Restructure User Guide in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to restructure User Guide in PySpark documentation for pandas APIs on Spark. Before ![Screen Shot 2021-06-08 at 8 47 41 PM](https://user-images.githubusercontent.com/6477701/121179493-cb85e280-c89a-11eb-8b93-552ebe7cd0a8.png) After ![Screen Shot 2021-06-08 at 8 46 58 PM](https://user-images.githubusercontent.com/6477701/121179419-b3ae5e80-c89a-11eb-82a0-6dabbf1de12d.png) Note that I mostly just moved the contents around except minor changes: - Removing some questions in FAQ that don't make sense in Apache Spark - Rename a subtitle "Working with pandas and PySpark" to "From/to pandas and PySpark DataFrames" For renaming Koalas to either pandas-on-Spark or pandas APIs on Spark, it will be done at SPARK-35591 ### Why are the changes needed? For better readability. ### Does this PR introduce _any_ user-facing change? Yes, it restructures the documentation as shown above. ### How was this patch tested? I manually built the docs and tested. Closes #32820 from HyukjinKwon/SPARK-35647. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-09 12:13:25 +09:00
liuqi	e79dd89cf6	[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function ### What changes were proposed in this pull request? Limit the batch size for `add_shuffle_key` in `partitionBy` function to fix `OverflowError: cannot convert float infinity to integer` ### Why are the changes needed? It's not easy to write a UT, but I can use some simple code to explain the bug. * Original code ``` def add_shuffle_key(split, iterator): buckets = defaultdict(list) c, batch = 0, min(10 * numPartitions, 1000) for k, v in iterator: buckets[partitionFunc(k) % numPartitions].append((k, v)) c += 1 # check used memory and avg size of chunk of objects if (c % 1000 == 0 and get_used_memory() > limit or c > batch): n, size = len(buckets), 0 for split in list(buckets.keys()): yield pack_long(split) d = outputSerializer.dumps(buckets[split]) del buckets[split] yield d size += len(d) avg = int(size / n) >> 20 # let 1M < avg < 10M if avg < 1: batch = 1.5 elif avg > 10: batch = max(int(batch / 1.5), 1) c = 0 ``` if `get_used_memory() > limit` always is `True` and `avg < 1` always is `True`, the variable `batch` will grow to infinity. then `batch = max(int(batch / 1.5), 1)` may raise `OverflowError` if `avg > 10` at some time. sample code to reproduce the bug ``` import sys limit = 100 used_memory = 200 numPartitions = 64 c, batch = 0, min(10 * numPartitions, 1000) while True: c += 1 if (c % 1000 == 0 and used_memory > limit or c > batch): batch = batch * 1.5 d = max(int(batch / 1.5), 1) print(c, batch) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? It's not easy to write a UT, there is sample code to test ``` import sys limit = 100 used_memory = 200 numPartitions = 64 c, batch = 0, min(10 * numPartitions, 1000) while True: c += 1 if (c % 1000 == 0 and used_memory > limit or c > batch): batch = min(sys.maxsize, batch * 1.5) d = max(int(batch / 1.5), 1) print(c, batch) ``` Closes #32667 from nolanliou/fix_partitionby. Authored-by: liuqi <nolan.liou@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-09 10:57:27 +09:00
Hyukjin Kwon	921abc51cf	[SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout ### What changes were proposed in this pull request? This PR proposes to restructure API files according to the layout, see https://github.com/apache/spark/pull/32799. Now the pandas APIs on Spark are under a separate directory which is same level as other modules such as Spark SQL. ```bash tree reference ``` Before: ``` reference ├── index.rst ├── ps_extensions.rst ├── ps_frame.rst ├── ps_general_functions.rst ├── ps_groupby.rst ├── ps_indexing.rst ├── ps_io.rst ├── ps_ml.rst ├── ps_series.rst ├── ps_window.rst ├── pyspark.ml.rst ├── pyspark.mllib.rst ├── pyspark.pandas.rst ├── pyspark.resource.rst ├── pyspark.rst ├── pyspark.sql.rst ├── pyspark.ss.rst └── pyspark.streaming.rst ``` After: ``` reference ├── index.rst ├── pyspark.ml.rst ├── pyspark.mllib.rst ├── pyspark.pandas │ ├── extensions.rst │ ├── frame.rst │ ├── general_functions.rst │ ├── groupby.rst │ ├── index.rst │ ├── indexing.rst │ ├── io.rst │ ├── ml.rst │ ├── series.rst │ └── window.rst ├── pyspark.resource.rst ├── pyspark.rst ├── pyspark.sql.rst ├── pyspark.ss.rst └── pyspark.streaming.rst ``` ### Why are the changes needed? To make the directory structure easier to follow. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually built and tested the docs. Closes #32812 from HyukjinKwon/SPARK-35646-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 19:01:56 +09:00
Takuya UESHIN	04418e18d7	[SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and StructFields ### What changes were proposed in this pull request? Introduces `InternalField` to manage dtypes and `StructField`s. `InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue. It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark. ### Why are the changes needed? Currently there are some performance issues in the pandas-on-Spark layer. One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types. We should reduce the amount of unnecessary access. ### Does this PR introduce _any_ user-facing change? Improves the performance in pandas-on-Spark layer: ```py df = ps.read_parquet("/path/to/test.parquet") # contains ~75 columns df = df[(df["col"] > 0) & (df["col"] < 10000)] ``` Before the PR, it took about 2.15 sec and after 1.15 sec. ### How was this patch tested? Existing tests. Closes #32775 from ueshin/issues/SPARK-35638/field. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 11:57:28 +09:00
Xinrong Meng	dfd8a8dc67	[SPARK-35341][PYTHON] Introduce BooleanExtensionOps ### What changes were proposed in this pull request? - Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based. - Improve error messages for operators `and` and `or`. ### Why are the changes needed? Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced. These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR. ### Does this PR introduce _any_ user-facing change? Yes. Error messages for operators `and` and `or` are improved. Before: ``` >>> psser = ps.Series([1, "x", "y"], dtype="category") >>> psser \| True Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).; 'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))] +- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L] +- LogicalRDD [__index_level_0__#8L, 0#9], false ``` After: ``` >>> psser = ps.Series([1, "x", "y"], dtype="category") >>> psser \| True Traceback (most recent call last): ... TypeError: Bitwise or can not be applied to categoricals. ``` ### How was this patch tested? Unit tests. Closes #32698 from xinrong-databricks/datatypeops_extension. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 15:43:52 -07:00
Xinrong Meng	04a8d2cbcf	[SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes ### What changes were proposed in this pull request? Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based. NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614. ### Why are the changes needed? The conversion from/to pandas includes logic for checking data types and behaving accordingly. That makes code hard to change or maintain. Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32592 from xinrong-databricks/datatypeop_pd_conversion. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 13:12:12 -07:00
Hyukjin Kwon	7ce7aa4758	[SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in documentation ### What changes were proposed in this pull request? This PR proposes to change from: ![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png) to: ![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png) ### Why are the changes needed? pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL). ### Does this PR introduce _any_ user-facing change? Yes, it changes the structure of the docs. To end users, no as it's only in development branch. ### How was this patch tested? Manually tested as above. Closes #32799 from HyukjinKwon/SPARK-35646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 16:37:58 +09:00
Xinrong Meng	50f7686de9	[SPARK-35599][PYTHON] Adjust `check_exact` parameter for older pd.testing ### What changes were proposed in this pull request? Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions. ### Why are the changes needed? `pd.testing` utils are utilized in pandas-on-Spark tests. Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`. We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32772 from xinrong-databricks/test_util. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 11:12:49 +09:00
itholic	b8740a1d1e	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes ### What changes were proposed in this pull request? This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis. By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules. ### Why are the changes needed? This can be reduces the cost of static analysis during development. It has been used continuously for about a year in the Koalas project and its convenience has been proven. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed. Closes #32779 from itholic/SPARK-35499. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-06 17:30:07 -07:00
Keerthan Vasist	f2c0a049a6	[SPARK-35643][PYTHON] Fix ambiguous reference in functions.py column() ### What changes were proposed in this pull request? In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended. This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643. ### Why are the changes needed? In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I don't believe this patch needs additional testing. Closes #32771 from keerthanvasist/col. Lead-authored-by: Keerthan Vasist <kvasist@amazon.com> Co-authored-by: keerthanvasist <kvasist@amazon.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-05 12:40:39 +09:00
Hyukjin Kwon	3d158f9c91	[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 11:11:09 +09:00
itholic	2658bc590f	[SPARK-35081][DOCS] Add Data Source Option links to missing documents ### What changes were proposed in this pull request? This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`. - Before <img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png"> - After <img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png"> ### Why are the changes needed? To provide users available options in detail with the proper documentation link. ### Does this PR introduce _any_ user-facing change? Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32762 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 13:52:46 +09:00
itholic	48252bac95	[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JDBC To Other Databases" page <img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png"> - Python ![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png) - Scala ![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png) - Java ![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32723 from itholic/SPARK-35583. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 14:21:16 +09:00
itholic	0ad5ae54b2	[SPARK-35539][PYTHON] Restore to_koalas to keep the backward compatibility ### What changes were proposed in this pull request? This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning. ### Why are the changes needed? If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work. ### Does this PR introduce _any_ user-facing change? No. It's restoring the existing functionality. ### How was this patch tested? Manually tested in local. ```shell >>> sdf.to_koalas() .../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead. warnings.warn( ``` Closes #32729 from itholic/SPARK-35539. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 10:39:24 +09:00
Xinrong Meng	0ac5c16177	[SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin ### What changes were proposed in this pull request? Support arithmetic operations against bool IndexOpsMixin. ### Why are the changes needed? Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors. pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is. We aim to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> 1 + psser Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> ps.Series([1, 2, 3]) + psser Traceback (most recent call last): ... TypeError: addition can not be applied to given types. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 0 2 1 2 2 1 dtype: int64 >>> 1 + psser 0 2 1 2 2 1 dtype: int64 >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) 0 2 1 3 2 3 dtype: int64 >>> ps.Series([1, 2, 3]) + psser 0 2 1 3 2 3 dtype: int64 ``` ### How was this patch tested? Unit tests. Closes #32611 from xinrong-databricks/datatypeop_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-01 10:57:12 -07:00
itholic	fe09def323	[SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents ### What changes were proposed in this pull request? This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents. ### Why are the changes needed? If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below: <img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png"> ### Does this PR introduce _any_ user-facing change? Yes, the `# noqa` is no more shown in the documents as below: <img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png"> ### How was this patch tested? Manually build docs and check. Closes #32728 from itholic/SPARK-35582. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 15:24:04 +09:00
itholic	73d4f67145	[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:58:49 +09:00
itholic	7e2717333b	[SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor ### What changes were proposed in this pull request? This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor". ### Why are the changes needed? Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark". So, the related code bases are all need to be changed. ### Does this PR introduce _any_ user-facing change? Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`. Note: `df.koalas.[...]` is still available but with deprecated warnings. ### How was this patch tested? Manually tested in local and checked one by one. Closes #32674 from itholic/SPARK-35453. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:33:10 +09:00
Hyukjin Kwon	7eb74482a7	[SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true ### What changes were proposed in this pull request? This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657. Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`. pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy. For the time being, I propose to exclude boolean case alone in percentile/quartile test case ### Why are the changes needed? To keep the test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I roughly locally tested. But it should pass in CI. Closes #32690 from HyukjinKwon/SPARK-35510. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-28 17:35:01 +09:00
Xinrong Meng	79a2a46cdb	[SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases ### What changes were proposed in this pull request? Re-enable some pandas-on-Spark test cases. ### Why are the changes needed? pandas version in GitHub Actions is upgraded now so we can re-enable some pandas-on-Spark test cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32682 from xinrong-databricks/enable_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 12:33:30 +09:00
Takuya UESHIN	d6d3209c2f	[SPARK-35537][PYTHON] Introduce a util function spark_column_equals ### What changes were proposed in this pull request? Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not. ### Why are the changes needed? In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one. We should introduce a util function for it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing tests. Closes #32680 from ueshin/issues/SPARK-35537/spark_column_equals. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 12:14:43 +09:00
Xinrong Meng	8cc7232ffa	[SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType ### What changes were proposed in this pull request? BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it. ### Why are the changes needed? The data-type-based-operations class should be set for each individual data type, including BinaryType. In addition, BinaryType has its special way of addition, which means concatenation. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b'1', b'2', b'3']) >>> psser + psser Traceback (most recent call last): ... TypeError: Type object was not understood. >>> psser + b'1' Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b'1', b'2', b'3']) >>> psser + psser 0 [49, 49] 1 [50, 50] 2 [51, 51] dtype: object >>> psser + b'1' 0 [49, 49] 1 [50, 49] 2 [51, 49] dtype: object ``` ### How was this patch tested? Unit tests. Closes #32665 from xinrong-databricks/datatypeops_binary. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-26 14:30:24 -07:00
Xinrong Meng	266608d50e	[SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps ### What changes were proposed in this pull request? The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately. ### Why are the changes needed? StructType, ArrayType, and MapType are not accepted by DataTypeOps now. We should handle these complex types. Among them: - ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation. - StructOps will be helpful to make to/from pandas conversion data-type-based. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) 0 [1.0, 2.0, 3.0, 0.4, 0.5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) 0 [1, 2, 3, 4, 5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Concatenation can only be applied to arrays of the same type ``` ### How was this patch tested? Unit tests. Closes #32626 from xinrong-databricks/datatypeop_complex. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-26 10:40:01 -07:00
itholic	79a6b0cc8a	[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move text data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Text Files" page <img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png"> - Python <img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png"> - Scala <img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png"> - Java <img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32660 from itholic/SPARK-35509. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 17:12:49 +09:00
Hyukjin Kwon	20750a3f9e	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception ### What changes were proposed in this pull request? This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python. While I am here, I fixed another minor issue at `DataFrams.schema` together: ```diff - except AttributeError as e: - raise Exception( - "Unable to parse datatype from schema. %s" % e) + except Exception as e: + raise ValueError( + "Unable to parse datatype from schema. %s" % e) from e ``` Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases. ### Why are the changes needed? For users to expect the proper exceptions. ### Does this PR introduce _any_ user-facing change? Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit. ### How was this patch tested? Existing unittests should cover, Closes #31238 Closes #32650 from HyukjinKwon/SPARK-32194. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 11:54:40 +09:00
Hyukjin Kwon	e47e615c0e	[SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions ### What changes were proposed in this pull request? This PR enables GitHub Actions to test PySpark with Python 3.9. ### Why are the changes needed? To verify the support of Python 3.9. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Existing tests should cover. Closes #32657 from HyukjinKwon/SPARK-35506. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 09:25:51 +09:00
Takuya UESHIN	d67d73b708	[SPARK-35505][PYTHON] Remove APIs which have been deprecated in Koalas ### What changes were proposed in this pull request? Removes APIs which have been deprecated in Koalas. ### Why are the changes needed? There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, the APIs deprecated in Koalas will be no longer available. ### How was this patch tested? Modified some tests which use the deprecated APIs, and the other existing tests should pass. Closes #32656 from ueshin/issues/SPARK-35505/remove_deprecated. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-25 11:16:27 -07:00
Hyukjin Kwon	4a6d844184	[SPARK-35497][PYTHON] Enable plotly tests in pandas-on-Spark ### What changes were proposed in this pull request? This PR enables plot tests with plotly ```bash ./python/run-tests --python-executables=python3 --modules=pyspark-pandas ``` Before: ``` Traceback (most recent call last): File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/.../pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 42, in <module> plotly_requirement_message + " Or pandas<1.0; pandas<1.0 does not support latest plotly " TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' ``` After: ``` ... Starting test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly ... Finished test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly (23s) ... Tests passed in 1296 seconds ``` ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? By running the tests. Closes #32649 from HyukjinKwon/SPARK-35497. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-25 12:31:32 +09:00
Weichen Xu	fdd7ca5f4e	[SPARK-35498][PYTHON] Add thread target wrapper API for pyspark pin thread mode ### What changes were proposed in this pull request? Add thread target wrapper API for pyspark pin thread mode. ### Why are the changes needed? A helper method which make user easier to write threading code under pin thread mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual. Closes #32644 from WeichenXu123/add_thread_target_wrapper_api. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-25 09:50:22 +09:00
Takuya UESHIN	1b75c2494c	[SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the files: - `python/pyspark/pandas/spark/accessors.py` - `python/pyspark/pandas/typedef/typehints.py` - `python/pyspark/pandas/utils.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more `disallow_untyped_defs` mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32627 from ueshin/issues/SPARK-35467_35468_35477/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-24 09:31:00 +09:00
Takuya UESHIN	2616d5cc1d	[SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module ### What changes were proposed in this pull request? Sets up the `mypy` configuration to enable `disallow_untyped_defs` check for pandas APIs on Spark module. ### Why are the changes needed? Currently many functions in the main codes in pandas APIs on Spark module are still missing type annotations and disabled `mypy` check `disallow_untyped_defs`. We should add more type annotations and enable the mypy check. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32614 from ueshin/issues/SPARK-35465/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-21 11:03:35 -07:00
itholic	d2bdd6595e	[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM](https://user-images.githubusercontent.com/44108233/119082866-e7375f00-ba39-11eb-9ade-a931a5957b34.png) - Python ![Screen Shot 2021-05-21 at 1 38 27 PM](https://user-images.githubusercontent.com/44108233/119082879-eef70380-ba39-11eb-9e8e-ee50eed98dbe.png) - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM](https://user-images.githubusercontent.com/44108233/119082884-f1595d80-ba39-11eb-98d5-966657df65f7.png) - Java ![Screen Shot 2021-05-21 at 1 37 19 PM](https://user-images.githubusercontent.com/44108233/119082888-f4544e00-ba39-11eb-8bf8-47ce78ec0b01.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:49 +09:00
itholic	419ddcb2a4	[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move JSON data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JSON Files" page <img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png"> - Python <img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png"> - Scala <img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png"> - Java <img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32204 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:13 +09:00
itholic	0fe65b5365	[SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move ORC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "ORC Files" page ![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png) - Python ![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png) - Scala ![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png) - Java ![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32546 from itholic/SPARK-35395. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:03:57 +09:00
itholic	6b912e4179	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes ### What changes were proposed in this pull request? There are still naming related to Koalas in test and function name. This PR addressed them to fit pandas-on-spark. - kdf -> psdf - kser -> psser - kidx -> psidx - kmidx -> psmidx - to_koalas() -> to_pandas_on_spark() ### Why are the changes needed? This is because the name Koalas is no longer used in PySpark. ### Does this PR introduce _any_ user-facing change? `to_koalas()` function is renamed to `to_pandas_on_spark()` ### How was this patch tested? Tested in local manually. After changing the related naming, I checked them one by one. Closes #32516 from itholic/SPARK-35364. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-20 15:08:30 -07:00
Xinrong Meng	a970f8505d	[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for pandas APIs on Spark, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32596 from xinrong-databricks/datatypeop_arith_fix. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-19 19:47:00 -07:00
Hyukjin Kwon	7eaabf4df5	[SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format ### What changes were proposed in this pull request? This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5. This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message. NOTE that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message. ### Why are the changes needed? To keep unofficial Python 3.5 support ### Does this PR introduce _any_ user-facing change? Officially nope. ### How was this patch tested? Ran the linters. Closes #32598 from HyukjinKwon/SPARK-35408=followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-20 10:47:31 +09:00
Takuya UESHIN	d44e6c7f10	Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures" This reverts commit `d1b24d8aba`.	2021-05-19 16:49:47 -07:00
Xinrong Meng	d1b24d8aba	[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for pandas APIs on Spark, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32469 from xinrong-databricks/datatypeop_arith. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-19 15:05:32 -07:00
Kousuke Saruta	9283bebbbd	[SPARK-35418][SQL] Add sentences function to functions.{scala,py} ### What changes were proposed in this pull request? This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`. ### Why are the changes needed? This function can be only used from SQL for now. It's good if we can use this function from Scala/Python code as well as SQL. ### Does this PR introduce _any_ user-facing change? Yes. Users can use this function from Scala and Python. ### How was this patch tested? New test. Closes #32566 from sarutak/sentences-function. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-19 20:07:28 +09:00
Hyukjin Kwon	747fe7282c	[SPARK-35419][PYTHON] Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` Before ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` After* ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Note that the traceback (`return f(args, *kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging. In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it. ### Why are the changes needed? To show simplified exception messages to end users. ### Does this PR introduce _any_ user-facing change? Yes, it will hide the internal Python worker traceback. ### How was this patch tested? Existing test cases should cover. Closes #32569 from HyukjinKwon/SPARK-35419. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-18 12:27:09 +09:00
Takuya UESHIN	2a335f2d7d	[SPARK-34941][PYTHON] Fix mypy errors and enable mypy check for pandas-on-Spark ### What changes were proposed in this pull request? Fixes `mypy` errors and enables `mypy` check for pandas-on-Spark. ### Why are the changes needed? The `mypy` check for pandas-on-Spark was disabled when the initial porting. It should be enabled again; otherwise we will miss type checking errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The enabled `mypy` check and existing unit tests should pass. Closes #32540 from ueshin/issues/SPARK-34941/pandas_mypy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-17 10:46:59 -07:00
Gera Shegalov	9eb45ecb4f	[SPARK-35408][PYTHON] Improve parameter validation in DataFrame.show ### What changes were proposed in this pull request? Provide clearer error message tied to the user's Python code if incorrect parameters are passed to `DataFrame.show` rather than the message about a missing JVM method the user is not calling directly. ``` py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748 ``` ### Why are the changes needed? For faster debugging through actionable error message. ### Does this PR introduce _any_ user-facing change? No change for the correct parameters but different error messages for the parameters triggering an exception. ### How was this patch tested? - unit test - manually in PySpark REPL Closes #32555 from gerashegalov/df_show_validation. Authored-by: Gera Shegalov <gera@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-17 16:22:46 +09:00
Sean Owen	a37cce95c2	[MINOR][DOCS] Add required imports to CV, train validation split Pyspark ML examples ### What changes were proposed in this pull request? Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit ### Why are the changes needed? The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work. ### Does this PR introduce _any_ user-facing change? No, docs only change. ### How was this patch tested? Existing tests. Closes #32554 from srowen/TuningImports. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-15 08:13:54 -05:00
Ruifeng Zheng	f7704ece40	[SPARK-35392][ML][PYTHON] Fix flaky tests in ml/clustering.py and ml/feature.py ### What changes were proposed in this pull request? This PR removes the check of `summary.logLikelihood` in ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if: - change number of partitions; - just change the way to compute the sum of weights; - change the underlying BLAS impl Also uses more permissive precision on `Word2Vec` test case. ### Why are the changes needed? To recover the build and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test cases. Closes #32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-13 22:23:51 +09:00
Takuya UESHIN	17b59a9970	[SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` Before: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]\| +------------------------------------------------------------------------+ ``` After: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]\| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-13 14:58:01 +09:00
Sean Owen	a189be8754	[MINOR][DOCS] Avoid some python docs where first sentence has "e.g." or similar ### What changes were proposed in this pull request? Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary. ### Why are the changes needed? See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated. ### Does this PR introduce _any_ user-facing change? Only changes docs. ### How was this patch tested? Manual testing of docs. Closes #32508 from srowen/TruncatedPythonDesc. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 10:38:59 +09:00
Xinrong Meng	5ecb112410	[SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst ### What changes were proposed in this pull request? Use full names of modules in `install.rst` when specifying dependencies. ### Why are the changes needed? Using full names makes it more clear. In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual verification. Closes #32427 from xinrong-databricks/nameDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 11:02:57 +09:00
Xinrong Meng	120c389b00	[SPARK-34887][PYTHON] Port Koalas dependencies into PySpark ### What changes were proposed in this pull request? Port Koalas dependencies appropriately to PySpark dependencies. ### Why are the changes needed? pandas-on-Spark has its own required dependency and optional dependencies. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #32386 from xinrong-databricks/portDeps. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:04:23 +09:00
garawalid	176218b6b8	[SPARK-35292][PYTHON] Delete redundant parameter in mypy configuration ### What changes were proposed in this pull request? The parameter no_implicit_optional is defined twice in the mypy configuration, [ligne 20](https://github.com/apache/spark/blob/master/python/mypy.ini#L20) and ligne 105. ### Why are the changes needed? We would like to keep the mypy configuration clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This patch can be tested with `dev/lint-python` Closes #32418 from garawalid/feature/clean-mypy-config. Authored-by: garawalid <gwalid94@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:01:34 +09:00
HyukjinKwon	8aaa9e890a	[SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 08:44:18 +09:00
Yikun Jiang	44b7931936	[SPARK-35176][PYTHON] Standardize input validation error type ### What changes were proposed in this pull request? This PR corrects some exception type when the function input params are failed to validate due to TypeError. In order to convenient to review, there are 3 commits in this PR: - Standardize input validation error type on sql - Standardize input validation error type on ml - Standardize input validation error type on pandas ### Why are the changes needed? As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them. [1] https://docs.python.org/3/library/exceptions.html#TypeError Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch. ### Does this PR introduce _any_ user-facing change? Yes, code can raise the right TypeError instead of ValueError. ### How was this patch tested? Existing test case and UT Closes #32368 from Yikun/SPARK-35176. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 15:34:24 +09:00
Yikun Jiang	0769049ee1	[SPARK-34979][PYTHON][DOC] Add PyArrow installation note for PySpark aarch64 user ### What changes were proposed in this pull request? This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0. ### Why are the changes needed? The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021. See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979). ### Does this PR introduce _any_ user-facing change? Yes, this doc can help user install arrow on aarch64. ### How was this patch tested? doc test passed. Closes #32363 from Yikun/SPARK-34979. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 09:56:17 +09:00
Ludovic Henry	5b77ebb57b	[SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib ### What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package. The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation. Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available. A table summarising which version gets loaded in which case: ``` \| \| BLAS.nativeBLAS \| BLAS.javaBLAS \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| with -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| wrapper for com.github.fommil:all \| (JDK16+, relies on the Vector API, requires \| \| \| 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| relies on the Foreign Linker API, requires \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| `--add-modules=jdk.incubator.foreign \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| -Dforeign.restricted=warn`) \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| 3. fails to load, falls back to BLAS.javaBLAS in \| wrapper for com.github.fommil:core \| \| \| org.apache.spark.ml.linalg.BLAS \| \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| without -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| relies on the Foreign Linker API, requires \| (JDK16+, relies on the Vector API, requires \| \| \| `--add-modules=jdk.incubator.foreign \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| -Dforeign.restricted=warn`) \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| 2. fails to load, falls back to BLAS.javaBLAS in \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| org.apache.spark.ml.linalg.BLAS \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| \| wrapper for com.github.fommil:core \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| ``` ### Why are the changes needed? Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available. ### Does this PR introduce _any_ user-facing change? No, all changes are transparent to the user. ### How was this patch tested? The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite. [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`: #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 223 232 8 448.0 2.2 1.0X [info] java 221 228 7 453.0 2.2 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 122 128 4 821.2 1.2 1.0X [info] java 122 128 4 822.3 1.2 1.0X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 112 2 921.4 1.1 1.0X [info] java 70 74 3 1423.5 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.1 1.0 1.0X [info] java 47 49 2 2121.7 0.5 2.0X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 184 195 8 544.3 1.8 1.0X [info] java 185 196 7 539.5 1.9 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 99 104 4 1011.9 1.0 1.0X [info] java 99 104 4 1010.4 1.0 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 947.2 1.1 1.0X [info] java 0 0 0 1584.8 0.6 1.7X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 867.4 1.2 1.0X [info] java 1 1 0 865.0 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 485.9 2.1 1.0X [info] java 1 1 0 486.8 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1843.0 0.5 1.0X [info] java 0 0 0 2690.6 0.4 1.5X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1214.7 0.8 1.0X [info] java 0 0 0 2536.8 0.4 2.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1895.9 0.5 1.0X [info] java 0 0 0 2961.1 0.3 1.6X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1223.4 0.8 1.0X [info] java 0 0 0 3091.4 0.3 2.5X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 560 575 20 1787.1 0.6 1.0X [info] java 226 232 5 4432.4 0.2 2.5X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 570 586 23 1755.2 0.6 1.0X [info] java 227 232 4 4410.1 0.2 2.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 863 879 17 1158.4 0.9 1.0X [info] java 227 231 3 4407.9 0.2 3.8X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1282 1305 23 780.0 1.3 1.0X [info] java 227 232 4 4413.4 0.2 5.7X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 538 548 8 1858.6 0.5 1.0X [info] java 221 226 3 4521.1 0.2 2.4X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 549 558 10 1819.9 0.5 1.0X [info] java 222 229 7 4503.5 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 838 852 12 1193.0 0.8 1.0X [info] java 222 229 5 4500.5 0.2 3.8X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 905 919 18 1104.8 0.9 1.0X [info] java 221 228 5 4521.3 0.2 4.1X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 195 204 10 512.7 2.0 1.0X [info] java 195 202 7 512.4 2.0 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 113 4 923.3 1.1 1.0X [info] java 102 107 4 984.4 1.0 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 107 110 3 938.1 1.1 1.0X [info] java 69 72 3 1447.1 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.5 1.0 1.0X [info] java 43 45 2 2317.1 0.4 2.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 155 168 8 644.2 1.6 1.0X [info] java 158 169 8 632.8 1.6 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 90 4 1178.1 0.8 1.0X [info] java 86 90 4 1167.7 0.9 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 1182.1 0.8 1.0X [info] java 0 0 0 1432.1 0.7 1.2X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 898.7 1.1 1.0X [info] java 1 1 0 891.5 1.1 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 495.4 2.0 1.0X [info] java 1 1 0 495.7 2.0 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2271.6 0.4 1.0X [info] java 0 0 0 3648.1 0.3 1.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1229.3 0.8 1.0X [info] java 0 0 0 2711.3 0.4 2.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2677.5 0.4 1.0X [info] java 0 0 0 3288.2 0.3 1.2X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1233.0 0.8 1.0X [info] java 0 0 0 2766.3 0.4 2.2X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 520 536 16 1923.6 0.5 1.0X [info] java 214 221 7 4669.5 0.2 2.4X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 593 612 17 1686.5 0.6 1.0X [info] java 215 219 3 4643.3 0.2 2.8X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 853 870 16 1172.8 0.9 1.0X [info] java 215 218 3 4659.7 0.2 4.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1350 1370 23 740.8 1.3 1.0X [info] java 215 219 4 4656.6 0.2 6.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 460 468 6 2173.2 0.5 1.0X [info] java 210 213 2 4752.7 0.2 2.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 535 544 8 1869.3 0.5 1.0X [info] java 210 215 5 4761.8 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 843 853 11 1186.8 0.8 1.0X [info] java 209 214 4 4793.4 0.2 4.0X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 891 904 15 1122.0 0.9 1.0X [info] java 209 214 4 4777.2 0.2 4.3X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 194 199 7 515.7 1.9 1.0X [info] java 181 186 3 551.1 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 115 4 915.0 1.1 1.0X [info] java 88 92 3 1138.8 0.9 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 110 2 922.6 1.1 1.0X [info] java 54 56 2 1839.2 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 97 2 1046.1 1.0 1.0X [info] java 29 30 1 3393.4 0.3 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 156 165 5 643.0 1.6 1.0X [info] java 150 159 5 667.1 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 91 6 1171.0 0.9 1.0X [info] java 75 79 3 1340.6 0.7 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 917.0 1.1 1.0X [info] java 0 0 0 8147.2 0.1 8.9X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 859.3 1.2 1.0X [info] java 1 1 0 859.3 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 482.1 2.1 1.0X [info] java 1 1 0 482.6 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2214.2 0.5 1.0X [info] java 0 0 0 7975.8 0.1 3.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1231.4 0.8 1.0X [info] java 0 0 0 8680.9 0.1 7.0X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2684.3 0.4 1.0X [info] java 0 0 0 18527.1 0.1 6.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1235.4 0.8 1.0X [info] java 0 0 0 17347.9 0.1 14.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 530 552 18 1887.5 0.5 1.0X [info] java 58 64 3 17143.9 0.1 9.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 598 620 17 1671.1 0.6 1.0X [info] java 58 64 3 17196.6 0.1 10.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 834 847 14 1199.4 0.8 1.0X [info] java 57 63 4 17486.9 0.1 14.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1338 1366 22 747.3 1.3 1.0X [info] java 58 63 3 17356.6 0.1 23.2X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 489 501 9 2045.5 0.5 1.0X [info] java 36 38 2 27721.9 0.0 13.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 478 488 9 2094.0 0.5 1.0X [info] java 36 38 2 27813.2 0.0 13.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 825 837 10 1211.6 0.8 1.0X [info] java 35 38 2 28433.1 0.0 23.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 900 918 15 1111.6 0.9 1.0X [info] java 36 38 2 28073.0 0.0 25.3X ``` [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas Closes #32253 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 14:00:59 -05:00
Julien Lafaye	592230e47b	[MINOR][DOCS][ML] Explicit return type of array_to_vector utility function There are two types of dense vectors: * pyspark.ml.linalg.DenseVector * pyspark.mllib.linalg.DenseVector In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector. The documentation is ambiguous & can lead to the false conclusion that instances of pyspark.mllib.linalg.DenseVector will be returned. Conversion from ml versions to mllib versions can easly be achieved with mlutils.convertVectorColumnsToML helper. ### What changes were proposed in this pull request? Make documentation more explicit ### Why are the changes needed? The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test were run as only the documentation was changed Closes #32255 from jlafaye/master. Authored-by: Julien Lafaye <jlafaye@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 09:08:26 -05:00
Ruifeng Zheng	1f150b9392	[SPARK-35024][ML] Refactor LinearSVC - support virtual centering ### What changes were proposed in this pull request? 1, remove existing agg, and use a new agg supporting virtual centering 2, add related testsuites ### Why are the changes needed? centering vectors should accelerate convergence, and generate solution more close to R ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated testsuites and added testsuites Closes #32124 from zhengruifeng/svc_agg_refactor. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-04-25 13:16:46 +08:00
Xinrong Meng	4fcbf59079	[SPARK-35040][PYTHON] Remove Spark-version related codes from test codes ### What changes were proposed in this pull request? Removes PySpark version dependent codes from pyspark.pandas test codes. ### Why are the changes needed? There are several places to check the PySpark version and switch the logic, but now those are not necessary. We should remove them. We will do the same thing after we finish porting tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32300 from xinrong-databricks/port.rmv_spark_version_chk_in_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-22 18:01:07 -07:00
Xinrong Meng	4d2b559d92	[SPARK-34999][PYTHON] Consolidate PySpark testing utils ### What changes were proposed in this pull request? Consolidate PySpark testing utils by removing `python/pyspark/pandas/testing`, and then creating a file `pandasutils` under `python/pyspark/testing` for test utilities used in `pyspark/pandas`. ### Why are the changes needed? `python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and `python/pyspark/testing` contain test utilities for pyspark. Consolidating them makes code cleaner and easier to maintain. Updated import statements are as shown below: - from pyspark.testing.sqlutils import SQLTestUtils - from pyspark.testing.pandasutils import PandasOnSparkTestCase, TestUtils (PandasOnSparkTestCase is the original ReusedSQLTestCase in `python/pyspark/pandas/testing/utils.py`) Minor improvements include: - Usage of missing library's requirement_message - `except ImportError` rather than `except` - import pyspark.pandas alias as `ps` rather than `pp` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests under python/pyspark/pandas/tests. Closes #32177 from xinrong-databricks/port.merge_utils. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-22 13:07:35 -07:00
harupy	b6350f5bb0	[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-04-21 16:29:10 +08:00
itholic	91bd38467e	[SPARK-34995] Port/integrate Koalas remaining codes into PySpark ### What changes were proposed in this pull request? There are some more changes in Koalas such as [databricks/koalas#2141](`c8f803d6be`), [databricks/koalas#2143](`913d68868d`) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`. ### Why are the changes needed? We should port the whole Koalas codes into PySpark and synchronize them. ### Does this PR introduce _any_ user-facing change? Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring. ### How was this patch tested? Manually tested in local. Closes #32197 from itholic/SPARK-34995-fix. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-16 17:42:03 +09:00
Xinrong Meng	4aee19efb4	[SPARK-35032][PYTHON] Port Koalas Index unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable Index unit tests. Closes #32139 from xinrong-databricks/port.indexes_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-16 08:53:30 +09:00
HyukjinKwon	637f59360b	Revert "[SPARK-34995] Port/integrate Koalas remaining codes into PySpark" This reverts commit `9689c44b60`.	2021-04-15 21:01:47 +09:00
itholic	9689c44b60	[SPARK-34995] Port/integrate Koalas remaining codes into PySpark ### What changes were proposed in this pull request? There are some more changes in Koalas such as [databricks/koalas#2141](`c8f803d6be`), [databricks/koalas#2143](`913d68868d`) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`. ### Why are the changes needed? We should port the whole Koalas codes into PySpark and synchronize them. ### Does this PR introduce _any_ user-facing change? Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring. ### How was this patch tested? Manually tested in local. Closes #32154 from itholic/SPARK-34995. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 19:13:08 +09:00
HyukjinKwon	7ff9d2e3ee	[SPARK-35071][PYTHON] Rename Koalas to pandas-on-Spark in main codes ### What changes were proposed in this pull request? This PR proposes to rename Koalas to pandas-on-Spark in main codes ### Why are the changes needed? To have the correct name in PySpark. NOTE that the official name in the main documentation will be pandas APIs on Spark to be extra clear. pandas-on-Spark is not the official term. ### Does this PR introduce _any_ user-facing change? No, it's master-only change. It changes the docstring and class names. ### How was this patch tested? Manually tested via: ```bash ./python/run-tests --python-executable=python3 --modules pyspark-pandas ``` Closes #32166 from HyukjinKwon/rename-koalas. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 12:48:59 +09:00
xinrong-databricks	58feb85145	[SPARK-35034][PYTHON] Port Koalas miscellaneous unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas miscellaneous unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable miscellaneous unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable miscellaneous unit tests. Closes #32152 from xinrong-databricks/port.misc_tests. Lead-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Co-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 11:45:15 +09:00
Yikun Jiang	31555f7779	[SPARK-34630][PYTHON][FOLLOWUP] Add __version__ into pyspark init __all__ ### What changes were proposed in this pull request? This patch add `__version__` into pyspark.__init__.__all__ to make the `__version__` as exported explicitly, see more in https://github.com/apache/spark/pull/32110#issuecomment-817331896 ### Why are the changes needed? 1. make the `__version__` as exported explicitly 2. cleanup `noqa: F401` on `__version` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Python related CI passed Closes #32125 from Yikun/SPARK-34629-Follow. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-04-14 23:36:25 +02:00
Takuya UESHIN	4ae57d5b3a	[SPARK-35039][PYTHON] Remove PySpark version dependent codes ### What changes were proposed in this pull request? Removes PySpark version dependent codes from `pyspark.pandas` main codes. ### Why are the changes needed? There are several places to check the PySpark version and switch the logic, but now those are not necessary. We should remove them. We will do the same thing after we finish porting tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32138 from ueshin/issues/SPARK-35039/pyspark_version. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 14:30:48 +09:00
Xinrong Meng	47d62af2a9	[SPARK-35035][PYTHON] Port Koalas internal implementation unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas internal implementation unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the internal implementation unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable internal implementation unit tests. Closes #32137 from xinrong-databricks/port.test_internal_impl. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:59:33 +09:00
Xinrong Meng	cd1e8e8158	[SPARK-35033][PYTHON] Port Koalas plot unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas plot unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the plot unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable plot unit tests. Closes #32151 from xinrong-databricks/port.plot_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:20:16 +09:00
Alex Mooney	faa928cefc	[MINOR][PYTHON][DOCS] Fix docstring for pyspark.sql.DataFrameWriter.json lineSep param ### What changes were proposed in this pull request? Add a new line to the `lineSep` parameter so that the doc renders correctly. ### Why are the changes needed? > <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png"> The first line of the description is part of the signature and is bolded. ### Does this PR introduce _any_ user-facing change? Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered. ### How was this patch tested? I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious. Closes #32153 from AlexMooney/patch-1. Authored-by: Alex Mooney <alexmooney@fastmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:14:51 +09:00
Xinrong Meng	8ebc3fca8c	[SPARK-35012][PYTHON] Port Koalas DataFrame-related unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame-related unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not fully tested. We should enable the DataFrame-related unit tests first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable DataFrame-related unit tests. Closes #32131 from xinrong-databricks/port.test_dataframe_related. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-13 14:24:08 -07:00
Xinrong Meng	a392633566	[SPARK-34996][PYTHON] Port Koalas Series-related unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Series related unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not fully tested. We should enable the Series related unit tests first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable Series-related unit tests. Closes #32117 from xinrong-databricks/port.test_series_related. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 13:03:35 +09:00
Xinrong Meng	9c1f807549	[SPARK-35031][PYTHON] Port Koalas operations on different frames tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas operations on different frames unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the operations on different frames unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable operations on different frames unit tests. Closes #32133 from xinrong-databricks/port.test_ops_on_diff_frames. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 11:22:51 +09:00
Yikun Jiang	b43f7e6a97	[SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.* ### What changes were proposed in this pull request? Fix type hints mismatches in pyspark.sql.* ### Why are the changes needed? There were some mismatches in pyspark.sql.* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? dev/lint-python passed. Closes #32122 from Yikun/SPARK-35019. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 11:21:13 +09:00
Luka Sturtewagen	fd8081cd27	[SPARK-34983][PYTHON] Renaming the package alias from pp to ps ### What changes were proposed in this pull request? This PR proposes to fix: ```python import pyspark.pandas as pp ``` to ```python import pyspark.pandas as ps ``` ### Why are the changes needed? `pp` might sound offensive in some contexts. ### Does this PR introduce _any_ user-facing change? The change is in master only. We'll use `ps` as the short name instead of `pp`. ### How was this patch tested? The CI in this PR will test it out. Closes #32108 from LSturtew/renaming_pyspark.pandas. Authored-by: Luka Sturtewagen <luka.sturtewagen@linkit.nl> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-12 11:18:08 +09:00
Takuya UESHIN	ff1fc5ed4b	[SPARK-34972][PYTHON][TEST][FOLLOWUP] Fix pyspark.pandas doctests which could be flaky ### What changes were proposed in this pull request? This is a follow-up of #32069. Makes some doctests which could be flaky skip. ### Why are the changes needed? Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic. - groupby-apply with UDF which has a return type annotation will lose its index. - `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32116 from ueshin/issues/SPARK-34972/fix_flaky_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-11 10:42:00 +09:00
Yikun Jiang	4c1ccdabe8	[SPARK-34630][PYTHON] Add typehint for pyspark.__version__ ### What changes were proposed in this pull request? This PR adds the typehint of pyspark.__version__, which was mentioned in [SPARK-34630](https://issues.apache.org/jira/browse/SPARK-34630). ### Why are the changes needed? There were some short discussion happened in https://github.com/apache/spark/pull/31823#discussion_r593830911 . After further deep investigation on [1][2], we can see the `pyspark.__version__` is added by [setup.py](`c06758834e/python/setup.py (L201)`), it makes `__version__` embedded into pyspark module, that means the `__init__.pyi` is the right place to add the typehint for `__version__`. So, this patch adds the type hint `__version__` in pyspark/__init__.pyi. [1] [PEP-396 Module Version Numbers](https://www.python.org/dev/peps/pep-0396/) [2] https://packaging.python.org/guides/single-sourcing-package-version/ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. Disable the ignore_error on `ee7bf7d962/python/mypy.ini (L132)` 2. Run mypy: - Before fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark \| grep version python/pyspark/pandas/spark/accessors.py:884: error: Module has no attribute "__version__" ``` - After fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark \| grep version ``` no output Closes #32110 from Yikun/SPARK-34629. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-11 10:40:08 +09:00
Xinrong Meng	3af2c1bb9c	[SPARK-34886][PYTHON] Port/integrate Koalas DataFrame unit test into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame unit test to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested at all. We should enable the DataFrame unit test first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable the DataFrame unit test. Closes #32083 from xinrong-databricks/port.test_dataframe. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-09 15:48:13 +09:00
Takuya UESHIN	2635c3894f	[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work ### What changes were proposed in this pull request? Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure. ### Why are the changes needed? Currently the pandas-on-Spark modules are not tested at all. We should enable doctests first, and we will port other unit tests separately later. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the whole doctests. Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-07 20:50:41 +09:00
Yikun Jiang	390d5bde81	[SPARK-34968][TEST][PYTHON] Add the `-fr` argument to xargs rm ### What changes were proposed in this pull request? This patch add the `-fr` argument to `xargs rm`. ### Why are the changes needed? This cmd is unavailable in basic case. If the find command does not get any search results, the rm command is invoked with an empty argument list, and then we will get a `rm: missing operand` and break, then the coverage report does not generate. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python The coverage report result is generated without break. Closes #32064 from Yikun/patch-1. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-06 15:20:55 -07:00
itholic	caf04f9b77	[SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark ### What changes were proposed in this pull request? As a first step of [SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849), this PR proposes porting the Koalas main code into PySpark. This PR contains minimal changes to the existing Koalas code as follows: 1. `databricks.koalas` -> `pyspark.pandas` 2. `from databricks import koalas as ks` -> `from pyspark import pandas as pp` 3. `ks.xxx -> pp.xxx` Other than them: 1. Added a line to `python/mypy.ini` in order to ignore the mypy test. See related issue at [SPARK-34941](https://issues.apache.org/jira/browse/SPARK-34941). 2. Added a comment to several lines in several files to ignore the flake8 F401. See related issue at [SPARK-34943](https://issues.apache.org/jira/browse/SPARK-34943). When this PR is merged, all the features that were previously used in [Koalas](https://github.com/databricks/koalas) will be available in PySpark as well. Users can access to the pandas API in PySpark as below: ```python >>> from pyspark import pandas as pp >>> ppdf = pp.DataFrame({"A": [1, 2, 3], "B": [15, 20, 25]}) >>> ppdf A B 0 1 15 1 2 20 2 3 25 ``` The existing "options and settings" in Koalas are also available in the same way: ```python >>> from pyspark.pandas.config import set_option, reset_option, get_option >>> ppser1 = pp.Series([1, 2, 3]) >>> ppser2 = pp.Series([3, 4, 5]) >>> ppser1 + ppser2 Traceback (most recent call last): ... ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option. >>> set_option("compute.ops_on_diff_frames", True) >>> ppser1 + ppser2 0 4 1 6 2 8 dtype: int64 ``` Please also refer to the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html) and [Options and Settings](https://koalas.readthedocs.io/en/latest/user_guide/options.html) for more detail. NOTE that this PR intentionally ports the main codes of Koalas first almost as are with minimal changes because: - Koalas project is fairly large. Making some changes together for PySpark will make it difficult to review the individual change. Koalas dev includes multiple Spark committers who will review. By doing this, the committers will be able to more easily and effectively review and drive the development. - Koalas tests and documentation require major changes to make it look great together with PySpark whereas main codes do not require. - We lately froze the Koalas codebase, and plan to work together on the initial porting. By porting the main codes first as are, it unblocks the Koalas dev to work on other items in parallel. I promise and will make sure on: - Rename Koalas to PySpark pandas APIs and/or pandas-on-Spark accordingly in documentation, and the docstrings and comments in the main codes. - Triage APIs to remove that don’t make sense when Koalas is in PySpark The documentation changes will be tracked in [SPARK-34885](https://issues.apache.org/jira/browse/SPARK-34885), the test code changes will be tracked in [SPARK-34886](https://issues.apache.org/jira/browse/SPARK-34886). ### Why are the changes needed? Please refer to: - [[DISCUSS] Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html) - [[VOTE] SPIP: Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html) ### Does this PR introduce _any_ user-facing change? Yes, now users can use the pandas APIs on Spark ### How was this patch tested? Manually tested for exposed major APIs and options as described above. ### Koalas contributors Koalas would not have been possible without the following contributors: ueshin HyukjinKwon rxin xinrong-databricks RainFung charlesdong1991 harupy floscha beobest2 thunterdb garawalid LucasG0 shril deepyaman gioa fwani 90jam thoo AbdealiJK abishekganesh72 gliptak DumbMachine dvgodoy stbof nitlev hjoo gatorsmile tomspur icexelloss awdavidson guyao akhilputhiry scook12 patryk-oleniuk tracek dennyglee athena15 gstaubli WeichenXu123 hsubbaraj lfdversluis ktksq shengjh margaret-databricks LSturtew sllynn manuzhang jijosg sadikovi Closes #32036 from itholic/SPARK-34890. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-06 12:42:39 +09:00
HyukjinKwon	2ca76a57be	[MINOR][DOCS] Use ASCII characters when possible in PySpark documentation ### What changes were proposed in this pull request? This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation ### Why are the changes needed? To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782 ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Found via (Mac OS): ```bash # In Spark root directory cd python pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .` ``` Closes #32048 from HyukjinKwon/minor-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-04 09:49:36 +03:00
David Li	1237124062	[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct ### What changes were proposed in this pull request? As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include: - toPandas() may be slower; - the resulting dataframe may not support some Pandas operations due to immutable backing arrays. ### Why are the changes needed? This will hopefully reduce user confusion as with SPARK-34463. ### Does this PR introduce _any_ user-facing change? Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental. ### How was this patch tested? This is a documentation-only change. Closes #31738 from lidavidm/spark-34463. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-30 13:30:27 +09:00
Kousuke Saruta	14c7bb877d	[SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters ### What changes were proposed in this pull request? This PR fixes an issue that `quoteIfNeeded` quotes a name only if it contains `.` or ``` ` ```. This method should quote it if it contains non-word characters. ### Why are the changes needed? It's a potential bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31964 from sarutak/fix-quoteIfNeeded. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-29 09:31:24 +00:00
Danny Meijer	ad211ccd9d	[SPARK-34630][PYTHON][SQL] Added typehint for pyspark.sql.Column.contains ### What changes were proposed in this pull request? This PR implements the missing typehints as per SPARK-34630. ### Why are the changes needed? To satisfy the aforementioned Jira ticket ### Does this PR introduce _any_ user-facing change? No, just adding a missing typehint for Project Zen ### How was this patch tested? No tests needed (just adding a typehint) Closes #31823 from dannymeijer/feature/SPARK-34630. Authored-by: Danny Meijer <danny.meijer@nike.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-03-24 15:21:19 +01:00
John Ayad	ddfc75ec64	[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes #31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-22 23:29:28 +09:00
HyukjinKwon	c7bf8adc38	[SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide at PySpark documentation ### What changes were proposed in this pull request? This PR proposes to reorder the items in User Guide in PySpark documentation in order to place general guides first and advance ones later. ### Why are the changes needed? For users to more easily follow. ### Does this PR introduce _any_ user-facing change? Yes, it changes the order in the items in documentation . ### How was this patch tested? Manually verified the documentation after building: <img width="768" alt="Screen Shot 2021-03-22 at 2 38 41 PM" src="https://user-images.githubusercontent.com/6477701/111945072-5537d680-8b1c-11eb-9f43-02f3ad63a509.png"> FWIW, the current page: https://spark.apache.org/docs/latest/api/python/user_guide/index.html Closes #31922 from HyukjinKwon/SPARK-34818. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-22 15:53:39 +09:00
Sean Owen	ed641fbad6	[MINOR][DOCS][ML] Doc 'mode' as a supported Imputer strategy in Pyspark ### What changes were proposed in this pull request? Document `mode` as a supported Imputer strategy in Pyspark docs. ### Why are the changes needed? Support was added in 3.1, and documented in Scala, but some Python docs were missed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #31883 from srowen/ImputerModeDocs. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-20 01:16:49 -05:00
Kousuke Saruta	03dd33cc98	[SPARK-25769][SPARK-34636][SPARK-34626][SQL] sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly ### What changes were proposed in this pull request? This PR fixes an issue that `sql` method in the following classes which take qualified names don't quote the qualified names properly. * UnresolvedAttribute * AttributeReference * Alias One instance caused by this issue is reported in SPARK-34626. ``` UnresolvedAttribute("a" :: "b" :: Nil).sql `a.b` // expected: `a`.`b` ``` And other instances are like as follows. ``` UnresolvedAttribute("a`b"::"c.d"::Nil).sql a`b.`c.d` // expected: `a``b`.`c.d` AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql c.d.`a.b` // expected: `c.d`.`a.b` Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` ``` ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31754 from sarutak/fix-qualified-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-12 02:58:46 +00:00
wankunde	60e324aa9f	[SPARK-34688][PYTHON] Upgrade to Py4J 0.10.9.2 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9.1 to 0.10.9.2 that contains some bug fixes and improvements. * expose shell parameter in Popen inside launch_gateway. ([bartdag/py4j220efc3](`220efc3716`)) * fixed Flake8 errors ([bartdag/py4j6c6ee9a](`6c6ee9aedc`)) ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31796 from wankunde/py4j. Authored-by: wankunde <wankunde@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-11 09:51:41 -06:00
HyukjinKwon	2526fdea48	[SPARK-34657][PYTHON][DOCS] Replace the tag of release to the hash to hide RC tags in Binder ### What changes were proposed in this pull request? Currently Binder link at Spark 3.1.1 (https://mybinder.org/v2/gh/apache/spark/v3.1.1-rc3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb) shows `v3.1.1-rc3` like: ![Screen Shot 2021-03-08 at 10 10 55 AM](https://user-images.githubusercontent.com/6477701/110262729-ecb70880-7ff7-11eb-92ba-f151d74985a6.png) After the fix, it will shows the explicit hash: ![Screen Shot 2021-03-08 at 10 17 25 AM](https://user-images.githubusercontent.com/6477701/110262740-f476ad00-7ff7-11eb-8632-5b418ff87024.png) In addition, this also fixes the examples URL while I am fixing it. For example: https://github.com/apache/spark/tree/v3.1.1-rc3/examples/src/main/python -> https://github.com/apache/spark/tree/1d550c4e902/examples/src/main/python Note that it is hash in order to make both dev and release easier. ### Why are the changes needed? To hide RC tags. ### Does this PR introduce _any_ user-facing change? It will just change the URL shown when Binder is being loaded. ### How was this patch tested? Manually tested: ```bash make clean html ``` ![Screen Shot 2021-03-08 at 10 17 06 AM](https://user-images.githubusercontent.com/6477701/110262813-2ee04a00-7ff8-11eb-9983-c4484f7832c4.png) ```bash git_hash=`git rev-parse --short HEAD` export GIT_HASH=$git_hash make clean html ``` ![Screen Shot 2021-03-08 at 10 17 25 AM](https://user-images.githubusercontent.com/6477701/110262805-2982ff80-7ff8-11eb-8560-e1e2aa7b263a.png) Closes #31773 from HyukjinKwon/SPARK-34657. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-08 10:48:17 +09:00
Peter Toth	ab8a9a0ceb	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType)))) ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType)))) ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ \| c3\| +---+ \|1.0\| +---+ >>> df.select("c4").show() +---+ \| c4\| +---+ \| 1\| +---+ >>> df.select("c3", "c4").show() +---+----+ \| c3\| c4\| +---+----+ \|1.0\|null\| +---+----+ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113 After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ \| c3\| c4\| +---+---+ \|1.0\| 1\| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-07 19:12:42 -06:00
Sean Owen	2f30cdebb1	[SPARK-34642][DOCS][ML] Fix TypeError in Pyspark Linear Regression docs ### What changes were proposed in this pull request? Fix a call to setParams in the Linear Regression docs example in Pyspark to avoid a TypeError. ### Why are the changes needed? The example is slightly wrong and we should not show an error in the docs. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Existing tests Closes #31760 from srowen/SPARK-34642. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-06 07:32:01 -08:00
Takuya UESHIN	331d459ee7	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests ### What changes were proposed in this pull request? Fixes a Python UDF `plus_one` used in `GroupedAggPandasUDFTests` to always return float (double) values. ### Why are the changes needed? The Python UDF `plus_one` used in `GroupedAggPandasUDFTests` is always returning `v + 1` regardless of its type. The return type of the UDF is 'double', so if the input is int, the result will be `null`. ```py >>> df = spark.range(10).toDF('id') \ ... .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ ... .withColumn("v", explode(col('vs'))) \ ... .drop('vs') \ ... .withColumn('w', lit(1.0)) >>> udf('double') ... def plus_one(v): ... assert isinstance(v, (int, float)) ... return v + 1 ... >>> pandas_udf('double', PandasUDFType.GROUPED_AGG) ... def sum_udf(v): ... return v.sum() ... >>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).show() +------------+----------+ \|plus_one(id)\|sum_udf(v)\| +------------+----------+ \| null\| 2900.0\| +------------+----------+ ``` This is meaningless and should be: ```py >>> udf('double') ... def plus_one(v): ... assert isinstance(v, (int, float)) ... return float(v + 1) ... >>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).sort('plus_one(id)').show() +------------+----------+ \|plus_one(id)\|sum_udf(v)\| +------------+----------+ \| 1.0\| 245.0\| \| 2.0\| 255.0\| \| 3.0\| 265.0\| \| 4.0\| 275.0\| \| 5.0\| 285.0\| \| 6.0\| 295.0\| \| 7.0\| 305.0\| \| 8.0\| 315.0\| \| 9.0\| 325.0\| \| 10.0\| 335.0\| +------------+----------+ ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Fixed the test. Closes #31730 from ueshin/issues/SPARK-34610/test_pandas_udf_grouped_agg. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 10:03:54 +09:00
HyukjinKwon	800590035c	[SPARK-34604][PYTHON][TESTS] Use eventually in TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse ### What changes were proposed in this pull request? `TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse` can be flaky and fails sometimes: ``` ====================================================================== ERROR [1.798s]: test_task_context_correct_with_python_worker_reuse (pyspark.tests.test_taskcontext.TaskContextTestsWithWorkerReuse) ... test_task_context_correct_with_python_worker_reuse self.assertTrue(pid in worker_pids) AssertionError: False is not true ---------------------------------------------------------------------- ``` I suspect that the Python worker was killed for whatever reason and new attempt created a new Python worker. This PR fixes the flakiness simply by retrying the test case. ### Why are the changes needed? To make the tests more robust. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested it by controlling the conditions manually in the test codes. Closes #31723 from HyukjinKwon/SPARK-34604. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 08:40:48 +09:00
Richard Penney	7d0743b493	[SPARK-33678][SQL] Product aggregation function ### Why is this change being proposed? This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group. This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark. This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly. ### Does this PR introduce _any_ user-facing change? No - only adds new function. ### How was this patch tested? Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself). An illustration of the new functionality, within PySpark is as follows: ``` import pyspark.sql.functions as pf, pyspark.sql.window as pw df = sqlContext.range(1, 17).toDF("x") win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x")) df.withColumn("factorial", pf.product("x").over(win)).show(20, False) +---+---------------+ \|x \|factorial \| +---+---------------+ \|1 \|1.0 \| \|2 \|2.0 \| \|3 \|6.0 \| \|4 \|24.0 \| \|5 \|120.0 \| \|6 \|720.0 \| \|7 \|5040.0 \| \|8 \|40320.0 \| \|9 \|362880.0 \| \|10 \|3628800.0 \| \|11 \|3.99168E7 \| \|12 \|4.790016E8 \| \|13 \|6.2270208E9 \| \|14 \|8.71782912E10 \| \|15 \|1.307674368E12 \| \|16 \|2.0922789888E13\| +---+---------------+ ``` Closes #30745 from rwpenney/feature/agg-product. Lead-authored-by: Richard Penney <rwp@rwpenney.uk> Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:51:07 +09:00
Phillip Henry	397b843890	[SPARK-34415][ML] Randomization in hyperparameter optimization ### What changes were proposed in this pull request? Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here: http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html All code is entirely my own work and I license the work to the project under the project’s open source license. ### Why are the changes needed? Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. ### Does this PR introduce _any_ user-facing change? A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined. ### How was this patch tested? Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added. `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface. `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed. Closes #31535 from PhillHenry/ParamRandomBuilder. Authored-by: Phillip Henry <PhillHenry@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-27 08:34:39 -06:00
HyukjinKwon	b5470ae294	[MINOR][DOCS] Replace http to https when possible in PySpark documentation ### What changes were proposed in this pull request? This PR proposes: - Change http to https for better security - Change http://apache-spark-developers-list.1001551.n3.nabble.com/ to official mailing list link (https://mail-archives.apache.org/mod_mbox/spark-dev/) ### Why are the changes needed? For better security, and to use official link. ### Does this PR introduce _any_ user-facing change? Yes, It exposes more secure and correct links to the PySpark end users in PySpark documentation. ### How was this patch tested? I manually checked if each link works Closes #31616 from HyukjinKwon/minor-https. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-23 11:18:47 +09:00
“attilapiros”	bdcad33d8b	[SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler ### What changes were proposed in this pull request? Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler. Some files and their responsibilities within this PR: - `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access - `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions - `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19". After this the correct process to update Jekyll version is the following: 1. update the version in `Gemfile` 2. run "bundle update" which updates the `Gemfile.lock` 3. commit both files This process for version update is tested for details please check the testing section. ### Why are the changes needed? Using different Jekyll versions can generate different output documents. This PR standardize the process. ### Does this PR introduce _any_ user-facing change? No, assuming the release was done via docker by using `do-release-docker.sh`. In that case there should be no difference at all as the same Jekyll version is specified in the Gemfile. ### How was this patch tested? #### Testing document generation Doc generation step was triggered via the docker release: ``` $ ./do-release-docker.sh -d ~/working -n -s docs ... ======================== = Building documentation... Command: /opt/spark-rm/release-build.sh docs Log file: docs.log Skipping publish step. ``` The docs.log contains the followings: ``` Building Spark docs Fetching gem metadata from https://rubygems.org/......... Using bundler 2.2.9 Fetching rb-fsevent 0.10.4 Fetching forwardable-extended 2.6.0 Fetching public_suffix 4.0.6 Fetching colorator 1.1.0 Fetching eventmachine 1.2.7 Fetching http_parser.rb 0.6.0 Fetching ffi 1.14.2 Fetching concurrent-ruby 1.1.8 Installing colorator 1.1.0 Installing forwardable-extended 2.6.0 Installing rb-fsevent 0.10.4 Installing public_suffix 4.0.6 Installing http_parser.rb 0.6.0 with native extensions Installing eventmachine 1.2.7 with native extensions Installing concurrent-ruby 1.1.8 Fetching rexml 3.2.4 Fetching liquid 4.0.3 Installing ffi 1.14.2 with native extensions Installing rexml 3.2.4 Installing liquid 4.0.3 Fetching mercenary 0.4.0 Installing mercenary 0.4.0 Fetching rouge 3.26.0 Installing rouge 3.26.0 Fetching safe_yaml 1.0.5 Installing safe_yaml 1.0.5 Fetching unicode-display_width 1.7.0 Installing unicode-display_width 1.7.0 Fetching webrick 1.7.0 Installing webrick 1.7.0 Fetching pathutil 0.16.2 Fetching kramdown 2.3.0 Fetching terminal-table 2.0.0 Fetching addressable 2.7.0 Fetching i18n 1.8.9 Installing terminal-table 2.0.0 Installing pathutil 0.16.2 Installing i18n 1.8.9 Installing addressable 2.7.0 Installing kramdown 2.3.0 Fetching kramdown-parser-gfm 1.1.0 Installing kramdown-parser-gfm 1.1.0 Fetching rb-inotify 0.10.1 Fetching sassc 2.4.0 Fetching em-websocket 0.5.2 Installing rb-inotify 0.10.1 Installing em-websocket 0.5.2 Installing sassc 2.4.0 with native extensions Fetching listen 3.4.1 Installing listen 3.4.1 Fetching jekyll-watch 2.2.1 Installing jekyll-watch 2.2.1 Fetching jekyll-sass-converter 2.1.0 Installing jekyll-sass-converter 2.1.0 Fetching jekyll 4.2.0 Installing jekyll 4.2.0 Fetching jekyll-redirect-from 0.16.0 Installing jekyll-redirect-from 0.16.0 Bundle complete! 4 Gemfile dependencies, 30 gems now installed. Bundled gems are installed into `./.local_ruby_bundle` ``` #### Testing Jekyll (or other gem) update First locally I reverted Jekyll to 4.1.0: ``` $ rm Gemfile.lock $ rm -rf .local_ruby_bundle # edited Gemfile to use version 4.1.0 $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.1.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ bundle install ... ``` Testing Jekyll version before the update: ``` $ bundle exec jekyll --version jekyll 4.1.0 ``` Imitating Jekyll update coming from git by reverting my local changes: ``` $ git checkout Gemfile Updated 1 path from the index $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.2.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ git checkout Gemfile.lock Updated 1 path from the index ``` Run the install: ``` $ bundle install ... ``` Checking the updated Jekyll version: ``` $ bundle exec jekyll --version jekyll 4.2.0 ``` Closes #31559 from attilapiros/pin-jekyll-version. Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 12:17:57 +09:00
Max Gekk	5957bc18a1	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs ### What changes were proposed in this pull request? Move the datetime rebase SQL configs from the `legacy` namespace by: 1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`. 2. Add the legacy configs as alternatives 3. Deprecate the legacy rebase configs. ### Why are the changes needed? The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs. ### Does this PR introduce _any_ user-facing change? Should not. Users will see a warning if they still use one of the legacy configs. ### How was this patch tested? 1. Manually checking new configs: ```scala scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res0: String = EXCEPTION scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") 21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead. scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res2: String = LEGACY ``` 2. By running a datetime rebasing test suite: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31576 from MaxGekk/rebase-confs-alternatives. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-17 14:04:47 +00:00
Eric Lemmon	e3b6e4ad43	[SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs ### What changes were proposed in this pull request? Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well. ### Why are the changes needed? No docs were generated for `pyspark.sql.conf.RuntimeConfig`. ### Does this PR introduce _any_ user-facing change? Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch. ### How was this patch tested? First built the Python docs: ```bash cd $SPARK_HOME/docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve ``` Then verified all pages and links: 1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page: http://localhost:4000/api/python/reference/index.html ![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png) 2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page: http://localhost:4000/api/python/reference/pyspark.sql.html#configuration ![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)** 3. RuntimeConfig page displayed: http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html ![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png) 4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page: http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html ![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png) Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable. Authored-by: Eric Lemmon <eric@lemmon.cc> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-13 09:32:55 -06:00
HyukjinKwon	92a83463c9	[SPARK-34408][PYTHON] Refactor spark.udf.register to share the same path to generate UDF instance ### What changes were proposed in this pull request? This PR proposes to use `_create_udf` where we need to create `UserDefinedFunction` to maintain codes easier. ### Why are the changes needed? For the better readability of codes and maintenance. ### Does this PR introduce _any_ user-facing change? No, refactoring. ### How was this patch tested? Ran the existing unittests. CI in this PR should test it out too. Closes #31537 from HyukjinKwon/SPARK-34408. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-11 10:57:02 +09:00
David Li	9b875ceada	[SPARK-32953][PYTHON][SQL] Add Arrow self_destruct support to toPandas ### What changes were proposed in this pull request? Creating a Pandas dataframe via Apache Arrow currently can use twice as much memory as the final result, because during the conversion, both Pandas and Arrow retain a copy of the data. Arrow has a "self-destruct" mode now (Arrow >= 0.16) to avoid this, by freeing each column after conversion. This PR integrates support for this in toPandas, handling a couple of edge cases: self_destruct has no effect unless the memory is allocated appropriately, which is handled in the Arrow serializer here. Essentially, the issue is that self_destruct frees memory column-wise, but Arrow record batches are oriented row-wise: ``` Record batch 0: allocation 0: column 0 chunk 0, column 1 chunk 0, ... Record batch 1: allocation 1: column 0 chunk 1, column 1 chunk 1, ... ``` In this scenario, Arrow will drop references to all of column 0's chunks, but no memory will actually be freed, as the chunks were just slices of an underlying allocation. The PR copies each column into its own allocation so that memory is instead arranged as so: ``` Record batch 0: allocation 0 column 0 chunk 0, allocation 1 column 1 chunk 0, ... Record batch 1: allocation 2 column 0 chunk 1, allocation 3 column 1 chunk 1, ... ``` The optimization is disabled by default, and can be enabled with the Spark SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" set to "true". We can't always apply this optimization because it's more likely to generate a dataframe with immutable buffers, which Pandas doesn't always handle well, and because it is slower overall (since it only converts one column at a time instead of in parallel). ### Why are the changes needed? This lets us load larger datasets - in particular, with N bytes of memory, before we could never load a dataset bigger than N/2 bytes; now the overhead is more like N/1.25 or so. ### Does this PR introduce _any_ user-facing change? Yes - it adds a new SQL conf "spark.sql.execution.arrow.pyspark.selfDestruct.enabled" ### How was this patch tested? See the [mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Reducing-memory-usage-of-toPandas-with-Arrow-quot-self-destruct-quot-option-td30149.html) - it was tested with Python memory_profiler. Unit tests added to check memory within certain bounds and correctness with the option enabled. Closes #29818 from lidavidm/spark-32953. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2021-02-10 09:58:46 -08:00
Liang-Chi Hsieh	1fbd576410	[SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document ### What changes were proposed in this pull request? This follows up #31160 to update score function in the document. ### Why are the changes needed? Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No, only doc change. Closes #31531 from viirya/SPARK-34080-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-10 09:24:25 +09:00
Max Gekk	a85490659f	[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new options for the Parquet datasource: 1. `datetimeRebaseMode` 2. `int96RebaseMode` Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps. The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from parquet files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy") spark.read.parquet(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" ``` Closes #31489 from MaxGekk/parquet-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 13:28:40 +00:00
gengjiaan	2c243c93d9	[SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`. As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`. Take `show tables` and `show table extended like 'tbl'` as example. The output before this PR: `show tables` \|database\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| If catalog is v2 session catalog, the output before this PR: \|namespace\|tableName\| -- \| -- \| default\| tbl `show table extended like 'tbl'` \|database\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| The output after this PR: `show tables` \|namespace\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| `show table extended like 'tbl'` \|namespace\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| ### Why are the changes needed? This PR have benefits as follows: First, Unify schema for the output of SHOW TABLES. Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? Yes. The output schema of `SHOW TABLES` replace `database` by `namespace`. ### How was this patch tested? Jenkins test. Closes #31245 from beliefer/SPARK-34157. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 08:39:58 +00:00
Xinrong Meng	747ad1809b	[PYTHON][MINOR] Fix docstring of DataFrame.join ### What changes were proposed in this pull request? Fix docstring of PySpark `DataFrame.join`. ### Why are the changes needed? For a better view of PySpark documentation. ### Does this PR introduce _any_ user-facing change? No (only documentation changes). ### How was this patch tested? Manual test. From ![image](https://user-images.githubusercontent.com/47337188/106977730-c14ab080-670f-11eb-8df8-5aea90902104.png) To ![image](https://user-images.githubusercontent.com/47337188/106977834-ed663180-670f-11eb-9c5e-d09be26e0ca8.png) Closes #31463 from xinrong-databricks/fixDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-06 09:08:49 -06:00
yi.wu	e9362c2571	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 16:25:32 +09:00
David Toneian	d99d0d27be	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python` This changeset is published into the public domain. ### What changes were proposed in this pull request? Some typos and syntax issues in docstrings and the output of `dev/lint-python` have been fixed. ### Why are the changes needed? In some places, the documentation did not refer to parameters or classes by the full and correct name, potentially causing uncertainty in the reader or rendering issues in Sphinx. Also, a typo in the standard output of `dev/lint-python` was fixed. ### Does this PR introduce _any_ user-facing change? Slight improvements in documentation, and in standard output of `dev/lint-python`. ### How was this patch tested? Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change. Closes #31401 from DavidToneian/SPARK-34300. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:30:50 +09:00
HyukjinKwon	30468a9015	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs ### What changes were proposed in this pull request? This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621. In more details, this PR: - Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases. - (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases. - Deprecates and renames: - `sumDistinct` -> `sum_distinct` - `bitwiseNOT` -> `bitwise_not` - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`) - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`) - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`) - (Scala-specific) `callUDF` -> `call_udf` ### Why are the changes needed? To keep the consistent naming in APIs. ### Does this PR introduce _any_ user-facing change? Yes, it deprecates some APIs and add new renamed APIs as described above. ### How was this patch tested? Unittests were added. Closes #31408 from HyukjinKwon/SPARK-34306. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:29:40 +09:00
Ruifeng Zheng	2c4e4f8412	[SPARK-34189][ML] w2v findSynonyms optimization ### What changes were proposed in this pull request? 1, use Guavaording instead of BoundedPriorityQueue; 2, use local variables; 3, avoid conversion: ml.vector -> mllib.vector ### Why are the changes needed? this pr is about 30% faster than existing impl ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing testsuites Closes #31276 from zhengruifeng/w2v_findSynonyms_opt. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-01-27 10:08:53 +08:00
Takuya UESHIN	43fdd1271e	[SPARK-33489][PYSPARK] Add NullType support for Arrow executions ### What changes were proposed in this pull request? Adds `NullType` support for Arrow executions. ### Why are the changes needed? As Arrow supports null type, we can convert `NullType` between PySpark and pandas with Arrow enabled. ### Does this PR introduce _any_ user-facing change? Yes, if a user has a DataFrame including `NullType`, it will be able to convert with Arrow enabled. ### How was this patch tested? Added tests. Closes #31285 from ueshin/issues/SPARK-33489/arrow_nulltype. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-25 11:34:47 +09:00
pgrz	121eb0130e	[SPARK-34191][PYTHON][SQL] Add typing for udf overload ### What changes were proposed in this pull request? Added typing for keyword-only single argument udf overload. ### Why are the changes needed? The intended use case is: ``` udf(returnType="string") def f(x): ... ``` ### Does this PR introduce _any_ user-facing change? Yes - a new typing for udf is considered valid. ### How was this patch tested? Existing tests. Closes #31282 from pgrz/patch-1. Authored-by: pgrz <grzegorski.piotr@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-22 21:19:20 +09:00
itholic	28131a7794	[SPARK-34190][DOCS] Supplement the description for Python Package Management ### What changes were proposed in this pull request? This PR supplements the contents in the "Python Package Management". If there is no Python installed in the local for all nodes when using `venv-pack`, job would fail as below. ```python >>> from pyspark.sql.functions import pandas_udf >>> pandas_udf('double') ... def pandas_plus_one(v: pd.Series) -> pd.Series: ... return v + 1 ... >>> spark.range(10).select(pandas_plus_one("id")).show() ... Cannot run program "./environment/bin/python": error=2, No such file or directory ... ``` This is because the Python in the [packed environment via `venv-pack` has a symbolic link](https://github.com/jcrist/venv-pack/issues/5) that connects Python to the local one. To avoid this confusion, it seems better to have an additional explanation for this. ### Why are the changes needed? To provide more detailed information to users so that they don’t get confused ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes the part of "Python Package Management" in the "User Guide" documents. ### How was this patch tested? Manually built the doc. ![Screen Shot 2021-01-21 at 7 10 38 PM](https://user-images.githubusercontent.com/44108233/105336258-5e8bec00-5c1c-11eb-870c-86acfc77c082.png) Closes #31280 from itholic/SPARK-34190. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-21 22:15:42 +09:00
HyukjinKwon	0130a3813a	[SPARK-33730][PYTHON][FOLLOW-UP] Consider the case when the current frame is None ### What changes were proposed in this pull request? This PR proposes to consider the case when [`inspect.currentframe()`](https://docs.python.org/3/library/inspect.html#inspect.currentframe) returns `None` because the underlyining Python implementation does not support frame. ### Why are the changes needed? To be safer and potentially for the official support of other Python implementations in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested via: When frame is available: ``` vi tmp.py ``` ```python from inspect import * lineno = getframeinfo(currentframe()).lineno + 1 if currentframe() is not None else 0 print(warnings.formatwarning( "Failed to set memory limit: {0}".format(Exception("argh!")), ResourceWarning, __file__, lineno), file=sys.stderr) ``` ``` python tmp.py ``` ``` /.../tmp.py:3: ResourceWarning: Failed to set memory limit: argh! print(warnings.formatwarning( ``` When frame is not available: ``` vi tmp.py ``` ```python from inspect import * lineno = getframeinfo(currentframe()).lineno + 1 if None is not None else 0 print(warnings.formatwarning( "Failed to set memory limit: {0}".format(Exception("argh!")), ResourceWarning, __file__, lineno), file=sys.stderr) ``` ``` python tmp.py ``` ``` /.../tmp.py:0: ResourceWarning: Failed to set memory limit: argh! ``` Closes #31239 from HyukjinKwon/SPARK-33730-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-19 15:30:42 +09:00
Ruifeng Zheng	d8cbef1abf	[SPARK-34093][ML] param maxDepth should check upper bound ### What changes were proposed in this pull request? update the ParamValidators of `maxDepth` ### Why are the changes needed? current impl of tree models only support maxDepth<=30 ### Does this PR introduce _any_ user-facing change? If `maxDepth`>30, fail quickly ### How was this patch tested? existing testsuites Closes #31163 from zhengruifeng/param_maxDepth_upbound. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-18 11:36:10 -06:00
zero323	098f2268e4	[SPARK-33730][PYTHON] Standardize warning types ### What changes were proposed in this pull request? This PR: - Adds as small hierarchy of warnings to be used in PySpark applications. These extend built-in classes and top level `PySparkWarning`. - Replaces `DeprecationWarnings` (intended for developers) with PySpark specific subclasses of `FutureWarning` (intended for end users). ### Why are the changes needed? - To be more precise and add users additional control (in addition to standard module level filters) over PySpark warnings handling. - Correct semantics (at the moment we use `DeprecationWarning` in user-facing API, but it is intended "for warnings about deprecated features when those warnings are intended for other Python developers"). ### Does this PR introduce _any_ user-facing change? Yes. Code can raise different type of warning than before. ### How was this patch tested? Existing tests. Closes #30985 from zero323/SPARK-33730. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 09:32:55 +09:00
Huaxin Gao	f3548837c6	[SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector ### What changes were proposed in this pull request? Add UnivariateFeatureSelector ### Why are the changes needed? Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors. ### Does this PR introduce _any_ user-facing change? Yes ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100) ``` Or numTopFeatures ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100) ``` ### How was this patch tested? Add Unit test Closes #31160 from huaxingao/UnivariateSelector. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-01-16 11:09:23 +08:00
ulysses-you	92e5cfd58d	[SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql ### What changes were proposed in this pull request? This PR aims to strip auto-generated cast. The main logic is: 1. Add tag if Cast is specified by user. 2. Wrap `PrettyAttribute` in usePrettyExpression. ### Why are the changes needed? Make sql consistent with dsl. Here is an inconsistent example before this PR: ``` -- output field name: FLOOR(1) spark.emptyDataFrame.select(floor(lit(1))) -- output field name: FLOOR(CAST(1 AS DOUBLE)) spark.sql("select floor(1)") ``` Note that, we don't remove the `Cast` so the auto-generated `Cast` can still work. The only changed place is `usePrettyExpression`, we use `PrettyAttribute` replace `Cast` to give a better sql string. ### Does this PR introduce _any_ user-facing change? Yes, the default field name may change. ### How was this patch tested? Add test and pass exists test. Closes #31034 from ulysses-you/SPARK-33989. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-14 15:27:14 +00:00
Takuya UESHIN	ad8e40e2ab	[SPARK-32338][SQL][PYSPARK][FOLLOW-UP][TEST] Add more tests for slice function ### What changes were proposed in this pull request? This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function. ### Why are the changes needed? The original PRs are missing tests with column-based arguments instead of literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests and existing tests. Closes #31159 from ueshin/issues/SPARK-32338/slice_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-13 09:56:38 +09:00
HyukjinKwon	aa388cf3d0	[SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs - `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page - Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html). - Mention other migration guides as well because PySpark can get affected by it. ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It fixes user-facing docs. However, it's not released out yet. ### How was this patch tested? Manually tested by running: ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` Closes #31082 from HyukjinKwon/SPARK-34041. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:28:31 +09:00
HyukjinKwon	ff284fb6ac	[SPARK-30681][PYTHON][FOLLOW-UP] Keep the name similar with Scala side in higher order functions ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side. Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Closes #31062 from HyukjinKwon/SPARK-30681. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 18:46:20 +09:00
HyukjinKwon	329850c667	[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 17:21:32 +09:00
HyukjinKwon	d6322bf70c	[SPARK-33983][PYTHON] Update cloudpickle to v1.6.0 ### What changes were proposed in this pull request? This PR proposes to upgrade cloudpickle from 1.5.0 to 1.6.0. It virtually contains one fix: `4510be850d` From a cursory look, this isn't a regression, and not even properly supported in Python: ```python >>> import pickle >>> pickle.dumps({}.keys()) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: cannot pickle 'dict_keys' object ``` So it seems fine not to backport. ### Why are the changes needed? To leverage bug fixes from the cloudpickle upstream. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub actions build will test it out. Closes #31007 from HyukjinKwon/cloudpickle-upgrade. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:36:31 -08:00
HyukjinKwon	6b86aa0b52	[SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements. It contains one bug fix (`4152353ac1`). ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31009 from HyukjinKwon/SPARK-33984. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:23:38 -08:00
Dongjoon Hyun	271c4f6e00	[SPARK-33978][SQL] Support ZSTD compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. ### Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. - https://issues.apache.org/jira/browse/ORC-363 BEFORE ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") ``` ```bash $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #31002 from dongjoon-hyun/SPARK-33978. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 00:54:47 -08:00
Gabor Somogyi	678294ddc2	[SPARK-33824][PYTHON][DOCS][FOLLOW-UP] Clarify about PYSPARK_DRIVER_PYTHON and spark.yarn.appMasterEnv.PYSPARK_PYTHON ### What changes were proposed in this pull request? This PR proposes to clarify: - `PYSPARK_DRIVER_PYTHON` should not be set for cluster modes in YARN and Kubernates. - `spark.yarn.appMasterEnv.PYSPARK_PYTHON` is not required in YARN. This is just another way to set `PYSPARK_PYTHON` that is specific for a Spark application. ### Why are the changes needed? To clarify what's required and not. ### Does this PR introduce _any_ user-facing change? Yes, this is a user-facing doc change. ### How was this patch tested? Manually tested. Note that this credits to gaborgsomogyi who actually tested and raised a doubt about this offline to me. I also manually tested all again to double check. Closes #30938 from HyukjinKwon/SPARK-33824-followup. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 09:52:42 +09:00
Yuanjian Li	86c1cfc579	[SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API ### What changes were proposed in this pull request? Follow up work for #30521, document the following behaviors in the API doc: - Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table. - Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created. ### Why are the changes needed? We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Document only. Closes #30885 from xuanyuanking/SPARK-33659. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-24 12:44:37 +09:00
Kyle Krueger	0bf3828ac4	[MINOR] update dstream.py with more accurate exceptions ### What changes were proposed in this pull request? Reopened from https://github.com/apache/spark/pull/27525. The exception messages for dstream.py when using windows were improved to be specific about what sliding duration is important. ### Why are the changes needed? The batch interval of dstreams are improperly named as sliding windows. The term sliding window is also used to reference the new window of a dstream collected over a window of rdds in a parent dstream. We should probably fix the naming convention of sliding window used in the dstream class, but for now more this more explicit exception message may reduce confusion. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It wasn't since this is only a change of the exception message Closes #30871 from kykrueger/kykrueger-patch-1. Authored-by: Kyle Krueger <kyle.s.krueger@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 14:17:09 -08:00
HyukjinKwon	4106731fdd	[SPARK-33836][SS][PYTHON][FOLLOW-UP] Use test utils and clean up doctests in table and toTable ### What changes were proposed in this pull request? This PR proposes to: - Make doctests simpler to show the usage (since we're not running them now). - Use the test utils to drop the tables if exists. ### Why are the changes needed? Better docs and code readability. ### Does this PR introduce _any_ user-facing change? No, dev-only. It includes some doc changes in unreleased branches. ### How was this patch tested? Manually tested. ```bash cd python ./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests" ``` Closes #30873 from HyukjinKwon/SPARK-33836. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2020-12-22 06:27:27 +09:00
HyukjinKwon	38bbccab75	[SPARK-33869][PYTHON][SQL][TESTS] Have a separate metastore directory for each PySpark test job ### What changes were proposed in this pull request? This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations. ### Why are the changes needed? To make PySpark tests less flaky. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested by trying some sleeps in https://github.com/apache/spark/pull/30873. Closes #30875 from HyukjinKwon/SPARK-33869. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 11:11:25 -08:00
Jungtaek Lim	8d4d433191	[SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable ### What changes were proposed in this pull request? This PR proposes to expose `DataStreamReader.table` (SPARK-32885) and `DataStreamWriter.toTable` (SPARK-32896) to PySpark, which are the only way to read and write with table in Structured Streaming. ### Why are the changes needed? Please refer SPARK-32885 and SPARK-32896 for rationalizations of these public APIs. This PR only exposes them to PySpark. ### Does this PR introduce _any_ user-facing change? Yes, PySpark users will be able to read and write with table in Structured Streaming query. ### How was this patch tested? Manually tested. > v1 table >> create table A and ingest to the table A ``` spark.sql(""" create table table_pyspark_parquet ( value long, `timestamp` timestamp ) USING parquet """) df = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query = df.writeStream.toTable('table_pyspark_parquet', checkpointLocation='/tmp/checkpoint5') query.lastProgress query.stop() ``` >> read table A and ingest to the table B which doesn't exist ``` df2 = spark.readStream.table('table_pyspark_parquet') query2 = df2.writeStream.toTable('table_pyspark_parquet_nonexist', format='parquet', checkpointLocation='/tmp/checkpoint2') query2.lastProgress query2.stop() ``` >> select tables ``` spark.sql("DESCRIBE TABLE table_pyspark_parquet").show() spark.sql("SELECT * FROM table_pyspark_parquet").show() spark.sql("DESCRIBE TABLE table_pyspark_parquet_nonexist").show() spark.sql("SELECT * FROM table_pyspark_parquet_nonexist").show() ``` > v2 table (leveraging Apache Iceberg as it provides V2 table and custom catalog as well) >> create table A and ingest to the table A ``` spark.sql(""" create table iceberg_catalog.default.table_pyspark_v2table ( value long, `timestamp` timestamp ) USING iceberg """) df = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query = df.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table', checkpointLocation='/tmp/checkpoint_v2table_1') query.lastProgress query.stop() ``` >> ingest to the non-exist table B ``` df2 = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query2 = df2.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist', checkpointLocation='/tmp/checkpoint_v2table_2') query2.lastProgress query2.stop() ``` >> ingest to the non-exist table C partitioned by `value % 10` ``` df3 = spark.readStream.format('rate').option('rowsPerSecond', 100).load() df3a = df3.selectExpr('value', 'timestamp', 'value % 10 AS partition').repartition('partition') query3 = df3a.writeStream.partitionBy('partition').toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned', checkpointLocation='/tmp/checkpoint_v2table_3') query3.lastProgress query3.stop() ``` >> select tables ``` spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table").show() spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist").show() spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show() ``` Closes #30835 from HeartSaVioR/SPARK-33836. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-21 19:42:59 +09:00
HyukjinKwon	6315118676	[SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page ### What changes were proposed in this pull request? This PR proposes to restructure and refine the Python dependency management page. I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation. FWIW, it has been reviewed by some tech writers and engineers. I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It's doc change but only in unreleased bracnhs for now. ### How was this patch tested? I manually built the docs as: ```bash cd python/docs make clean html open ``` Closes #30822 from HyukjinKwon/SPARK-33824. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-18 10:03:07 +09:00
HyukjinKwon	e2cdfcebd9	[SPARK-32447][CORE][PYTHON][FOLLOW-UP] Fix other occurrences of 'python' to 'python3' ### What changes were proposed in this pull request? This PR proposes to change python to python3 in several places missed. ### Why are the changes needed? To use Python 3 by default safely. ### Does this PR introduce _any_ user-facing change? Yes, it will uses `python3` as its default Python interpreter. ### How was this patch tested? It was tested together in https://github.com/apache/spark/pull/30735. The test cases there will verify this change together. Closes #30750 from HyukjinKwon/SPARK-32447. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-13 10:41:47 +09:00
Fokko Driesprong	e4d1c10760	[SPARK-32320][PYSPARK] Remove mutable default arguments This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ ``` fokkodriesprongFan spark % grep -R "={}" python \| grep def python/pyspark/resource/profile.py: def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}): python/pyspark/sql/functions.py:def from_json(col, schema, options={}): python/pyspark/sql/functions.py:def to_json(col, options={}): python/pyspark/sql/functions.py:def schema_of_json(json, options={}): python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}): python/pyspark/sql/functions.py:def to_csv(col, options={}): python/pyspark/sql/functions.py:def from_csv(col, schema, options={}): python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}): ``` ``` fokkodriesprongFan spark % grep -R "=\[\]" python \| grep def python/pyspark/ml/tuning.py: def __init__(self, bestModel, avgMetrics=[], subModels=None): python/pyspark/ml/tuning.py: def __init__(self, bestModel, validationMetrics=[], subModels=None): ``` ### What changes were proposed in this pull request? Removing the mutable default arguments. ### Why are the changes needed? Removing the mutable default arguments, and changing the signature to `Optional[...]`. ### Does this PR introduce _any_ user-facing change? No 👍 ### How was this patch tested? Using the Flake8 bugbear code analysis plugin. Closes #29122 from Fokko/SPARK-32320. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 09:35:36 +08:00
HyukjinKwon	5250841537	[SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation style ### What changes were proposed in this pull request? This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085. Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work. ### Why are the changes needed? To tell developers that PySpark now follows NumPy documentation style. ### Does this PR introduce _any_ user-facing change? No, it's a change in unreleased branches yet. ### How was this patch tested? Manually tested via `make clean html` under `python/docs`: ![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png) Closes #30622 from HyukjinKwon/SPARK-33256. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 01:22:24 -08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
Weichen Xu	7e759b2d95	[SPARK-33520][ML][PYSPARK] make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator ### What changes were proposed in this pull request? make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/model ### Why are the changes needed? Currently, pyspark support third-party library to define python backend estimator/evaluator, i.e., estimator that inherit `Estimator` instead of `JavaEstimator`, and only can be used in pyspark. CrossValidator and TrainValidateSplit support tuning these python backend estimator, but cannot support saving/load, becase CrossValidator and TrainValidateSplit writer implementation is use JavaMLWriter, which require to convert nested estimator and evaluator into java instance. OneVsRest saving/load now only support java backend classifier due to similar issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30471 from WeichenXu123/support_pyio_tuning. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-12-04 08:35:50 +08:00
Gabor Somogyi	bd711863fd	[SPARK-33629][PYTHON] Make spark.buffer.size configuration visible on driver side ### What changes were proposed in this pull request? `spark.buffer.size` not applied in driver from pyspark. In this PR I've fixed this issue. ### Why are the changes needed? Apply the mentioned config on driver side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests + manually. Added the following code temporarily: ``` def local_connect_and_auth(port, auth_secret): ... sock.connect(sa) print("SPARK_BUFFER_SIZE: %d" % int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) <- This is the addition sockfile = sock.makefile("rwb", int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) ... ``` Test: ``` #Compile Spark echo "spark.buffer.size 10000" >> conf/spark-defaults.conf $ ./bin/pyspark Python 3.8.5 (default, Jul 21 2020, 10:48:26) [Clang 11.0.3 (clang-1103.0.32.62)] on darwin Type "help", "copyright", "credits" or "license" for more information. 20/12/03 13:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/12/03 13:38:14 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Python version 3.8.5 (default, Jul 21 2020 10:48:26) Spark context Web UI available at http://192.168.0.189:4040 Spark context available as 'sc' (master = local[*], app id = local-1606999094506). SparkSession available as 'spark'. >>> sc.setLogLevel("TRACE") >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect() ... SPARK_BUFFER_SIZE: 10000 ... [[0], [2], [3], [4], [6]] >>> ``` Closes #30592 from gaborgsomogyi/SPARK-33629. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 01:37:44 +09:00
Liang-Chi Hsieh	3b2ff16ee6	[SPARK-33636][PYTHON][ML][FOLLOWUP] Update since tag of labelsArray in StringIndexer ### What changes were proposed in this pull request? This is to update `labelsArray`'s since tag. ### Why are the changes needed? The original change was backported to branch-3.0 for 3.0.2 version. So it is better to update the since tag to reflect the fact. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A. Just tag change. Closes #30582 from viirya/SPARK-33636-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-03 14:34:44 +09:00
Liang-Chi Hsieh	0880989755	[SPARK-22798][PYTHON][ML][FOLLOWUP] Add labelsArray to PySpark StringIndexer ### What changes were proposed in this pull request? This is a followup to add missing `labelsArray` to PySpark `StringIndexer`. ### Why are the changes needed? `labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too. ### Does this PR introduce _any_ user-facing change? Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0. ### How was this patch tested? Unit test. Closes #30579 from viirya/SPARK-22798-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-03 10:57:14 +09:00
HyukjinKwon	1a042cc414	[SPARK-33530][CORE] Support --archives and spark.archives option natively ### What changes were proposed in this pull request? TL;DR: - This PR completes the support of archives in Spark itself instead of Yarn-only - It makes `--archives` option work in other cluster modes too and adds `spark.archives` configuration. - After this PR, PySpark users can leverage Conda to ship Python packages together as below: ```python conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0 conda activate pyspark_env conda pack -f -o pyspark_env.tar.gz PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment ``` - Issue a warning that undocumented and hidden behavior of partial archive handling in `spark.files` / `SparkContext.addFile` will be deprecated, and users can use `spark.archives` and `SparkContext.addArchive`. This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode: ```bash ./bin/spark-submit --help ``` ``` Options: ... Spark on YARN only: --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. ``` This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment. Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0). The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it. Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well. ### Why are the changes needed? To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env. ### Does this PR introduce _any_ user-facing change? Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`. ### How was this patch tested? I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes. Closes #30486 from HyukjinKwon/native-archive. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 13:43:02 +09:00
Weichen Xu	80161238fe	[SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading ### What changes were proposed in this pull request? Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-01 09:36:42 +08:00
Bryan Cutler	aeb3649fb9	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests ### What changes were proposed in this pull request? This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests. - https://docs.python.org/3/library/unittest.html#deprecated-aliases ### Why are the changes needed? For consistency and eventual removal of deprecated APIs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 10:34:40 +09:00
Weichen Xu	596fbc1d29	[SPARK-33556][ML] Add array_to_vector function for dataframe column ### What changes were proposed in this pull request? Add array_to_vector function for dataframe column ### Why are the changes needed? Utility function for array to vector conversion. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? scala unit test & doctest. Closes #30498 from WeichenXu123/array_to_vec. Lead-authored-by: Weichen Xu <weichen.xu@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 09:52:19 +09:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
yangjie01	433ae9064f	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV ### What changes were proposed in this pull request? There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different. The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable. On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default. So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv. ### Why are the changes needed? Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add a new case similar to that described in SPARK-33566 Closes #30518 from LuciferYang/SPARK-33566. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 15:47:39 +09:00
zero323	d082ad0abf	[SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig functions in PySpark and SparkR ### What changes were proposed in this pull request? This PR adds the following functions (introduced in Scala API with SPARK-33061): - `acosh` - `asinh` - `atanh` to Python and R. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions. ### How was this patch tested? New unit tests. Closes #30501 from zero323/SPARK-33563. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 11:00:09 +09:00
shane knapp	c529426d87	[SPARK-33565][BUILD][PYTHON] remove python3.8 and fix breakage ### What changes were proposed in this pull request? remove python 3.8 from python/run-tests.py and stop build breaks ### Why are the changes needed? the python tests are running against the bare-bones system install of python3, rather than an anaconda environment. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? via jenkins Closes #30506 from shaneknapp/remove-py38. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2020-11-25 15:15:50 -08:00
zero323	01321bc0fe	[SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.mllib` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. Before: ![old](https://user-images.githubusercontent.com/1554276/100097941-90234980-2e5d-11eb-8b4d-c25d98d85191.png) After: ![new](https://user-images.githubusercontent.com/1554276/100097966-987b8480-2e5d-11eb-9e02-07b18c327624.png) ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30413 from zero323/SPARK-33252. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-25 10:24:41 +09:00
zero323	665817bd4f	[SPARK-33457][PYTHON] Adjust mypy configuration ### What changes were proposed in this pull request? This pull request: - Adds following flags to the main mypy configuration: - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional) - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional) - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls) These flags are enabled only for public API and disabled for tests and internal modules. Additionally, these PR fixes missing annotations. ### Why are the changes needed? Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example https://github.com/apache/spark/pull/29122#pullrequestreview-513112882 Additionally, it will allow us to detect cases where annotations have unintentionally omitted. ### Does this PR introduce _any_ user-facing change? Annotations only. ### How was this patch tested? `dev/lint-python`. Closes #30382 from zero323/SPARK-33457. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-25 09:27:04 +09:00
Gabor Somogyi	0bb911d979	[SPARK-33143][PYTHON] Add configurable timeout to python server and client ### What changes were proposed in this pull request? Spark creates local server to serialize several type of data for python. The python code tries to connect to the server, immediately after it's created but there are several system calls in between (this may change in each Spark version): * getaddrinfo * socket * settimeout * connect Under some circumstances in heavy user environments these calls can be super slow (more than 15 seconds). These issues must be analyzed one-by-one but since these are system calls the underlying OS and/or DNS servers must be debugged and fixed. This is not trivial task and at the same time data processing must work somehow. In this PR I'm only intended to add a configuration possibility to increase the mentioned timeouts in order to be able to provide temporary workaround. The rootcause analysis is ongoing but I think this can vary in each case. Because the server part doesn't contain huge amount of log entries to with one can measure time, I've added some. ### Why are the changes needed? Provide workaround when localhost python server connection timeout appears. ### Does this PR introduce _any_ user-facing change? Yes, new configuration added. ### How was this patch tested? Existing unit tests + manual test. ``` #Compile Spark echo "spark.io.encryption.enabled true" >> conf/spark-defaults.conf echo "spark.python.authenticate.socketTimeout 10" >> conf/spark-defaults.conf $ ./bin/pyspark Python 3.8.5 (default, Jul 21 2020, 10:48:26) [Clang 11.0.3 (clang-1103.0.32.62)] on darwin Type "help", "copyright", "credits" or "license" for more information. 20/11/20 10:17:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/11/20 10:17:03 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Python version 3.8.5 (default, Jul 21 2020 10:48:26) Spark context Web UI available at http://192.168.0.189:4040 Spark context available as 'sc' (master = local[*], app id = local-1605863824276). SparkSession available as 'spark'. >>> sc.setLogLevel("TRACE") >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect() 20/11/20 10:17:09 TRACE PythonParallelizeServer: Creating listening socket 20/11/20 10:17:09 TRACE PythonParallelizeServer: Setting timeout to 10 sec 20/11/20 10:17:09 TRACE PythonParallelizeServer: Waiting for connection on port 59726 20/11/20 10:17:09 TRACE PythonParallelizeServer: Connection accepted from address /127.0.0.1:59727 20/11/20 10:17:09 TRACE PythonParallelizeServer: Client authenticated 20/11/20 10:17:09 TRACE PythonParallelizeServer: Closing server ... 20/11/20 10:17:10 TRACE SocketFuncServer: Creating listening socket 20/11/20 10:17:10 TRACE SocketFuncServer: Setting timeout to 10 sec 20/11/20 10:17:10 TRACE SocketFuncServer: Waiting for connection on port 59735 20/11/20 10:17:10 TRACE SocketFuncServer: Connection accepted from address /127.0.0.1:59736 20/11/20 10:17:10 TRACE SocketFuncServer: Client authenticated 20/11/20 10:17:10 TRACE SocketFuncServer: Closing server [[0], [2], [3], [4], [6]] >>> ``` Closes #30389 from gaborgsomogyi/SPARK-33143. Lead-authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-23 15:19:34 +09:00
CC Highman	d338af3101	[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source ### What changes were proposed in this pull request? Two new options, _modifiiedBefore_ and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.` In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written. ### Why are the changes needed? When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code. ### Does this PR introduce _any_ user-facing change? This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option. Example Usages _Load all CSV files modified after date:_ `spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()` _Load all CSV files modified before date:_ `spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()` _Load all CSV files modified between two dates:_ `spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load() ` ### How was this patch tested? A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments. (quoted from cchighman) Closes #30411 from HeartSaVioR/SPARK-31962. Lead-authored-by: CC Highman <christopher.highman@microsoft.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-23 08:30:41 +09:00
Ruifeng Zheng	116b7b72a1	[SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy ### What changes were proposed in this pull request? impl a new strategy `mode`: replace missing using the most frequent value along each column. ### Why are the changes needed? it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time. ### Does this PR introduce _any_ user-facing change? Yes, a new strategy is added ### How was this patch tested? updated testsuites Closes #30397 from zhengruifeng/imputer_max_freq. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-20 11:35:34 -06:00
HyukjinKwon	02d410a18c	[MINOR][DOCS] Document 'without' value for HADOOP_VERSION in pip installation ### What changes were proposed in this pull request? I believe it's self-descriptive. ### Why are the changes needed? To document supported features. ### Does this PR introduce _any_ user-facing change? Yes, the docs are updated. It's master only. ### How was this patch tested? Manually built the docs via `cd python/docs` and `make clean html`: ![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png) Closes #30436 from HyukjinKwon/minor-doc-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-20 13:14:20 +09:00
zhengruifeng	689c294102	[SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR ### What changes were proposed in this pull request? use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. ### Does this PR introduce _any_ user-facing change? yes, param blockSize -> blockSizeInMB in master ### How was this patch tested? updated testsuites Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-18 23:02:31 +08:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
Liang-Chi Hsieh	7f3d99a8a5	[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc ### What changes were proposed in this pull request? This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now. ### Why are the changes needed? The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs. ### Does this PR introduce _any_ user-facing change? Yes, update user-facing doc. ### How was this patch tested? Unit test. Closes #30396 from viirya/minor-json-csv. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 11:32:27 +09:00
HyukjinKwon	e2c7bfce40	[SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default) ### What changes were proposed in this pull request? This PR proposes to simplify the exception messages from Python UDFS. Currently, the exception message from Python UDFs is as below: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases. This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` The trackback will be shown from the point when any non-PySpark file is seen in the traceback. ### Why are the changes needed? Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary. This configuration enables the ability to show simplified tracebacks which users will likely be most interested in. ### Does this PR introduce _any_ user-facing change? By default, no. It adds one configuration that simplifies the exception message. See the example above. ### How was this patch tested? Manually tested: ```bash $ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true ``` ```python from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` and unittests were also added. Closes #30309 from HyukjinKwon/SPARK-33407. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-17 14:15:31 +09:00
itholic	236c6c9f7c	[SPARK-33253][PYTHON][DOCS] Migration to NumPy documentation style in Streaming (pyspark.streaming.*) ### What changes were proposed in this pull request? This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also [SPARK-33243](https://issues.apache.org/jira/browse/SPARK-33243). ### Why are the changes needed? For better documentation as text itself, and generated HTMLs ### Does this PR introduce _any_ user-facing change? Yes, they will see a better format of HTMLs, and better text format. See [SPARK-33243](https://issues.apache.org/jira/browse/SPARK-33243). ### How was this patch tested? Manually tested via running ./dev/lint-python. Closes #30346 from itholic/SPARK-32085. Lead-authored-by: itholic <haejoon309@naver.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 10:44:57 +09:00
zero323	52073ef8ac	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.) ### What changes were proposed in this pull request? This PR proposes migration of Core to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? dev/lint-python and manual inspection. Closes #30320 from zero323/SPARK-33254. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 10:21:50 +09:00
xuewei.linxuewei	234711a328	Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession" ### What changes were proposed in this pull request? In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed. [SPARK-33139] has two commit, include a follow up. Revert them both. ### Why are the changes needed? Revert. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30367 from leanken/leanken-revert-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 13:35:45 +00:00
zhengruifeng	a2887164bc	[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC ### What changes were proposed in this pull request? 1, use `maxBlockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors; 2, infer an appropriate `maxBlockSizeInMB` if set 0; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. f2jBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 326481 \| 26143 \| 25710 \| 24726 \| 25395 \| 25840 \| 26846 \| 25927 \| 27431 \| 26190 \| 26056 \| 26347 \| 27204 epsilon3000(67%) \| 455247 \| 35893 \| 34366 \| 34985 \| 38387 \| 38901 \| 40426 \| 40044 \| 39161 \| 38767 \| 39965 \| 39523 \| 39108 epsilon4000(50%) \| 306390 \| 42256 \| 41164 \| 43748 \| 48638 \| 50892 \| 50986 \| 51091 \| 51072 \| 51289 \| 51652 \| 53312 \| 52146 epsilon5000(40%) \| 307619 \| 43639 \| 42992 \| 44743 \| 50800 \| 51939 \| 51871 \| 52190 \| 53850 \| 52607 \| 51062 \| 52509 \| 51570 epsilon10000(20%) \| 310070 \| 58371 \| 55921 \| 56317 \| 56618 \| 53694 \| 52131 \| 51768 \| 51728 \| 52233 \| 51881 \| 51653 \| 52440 epsilon20000(10%) \| 316565 \| 109193 \| 95121 \| 82764 \| 69653 \| 60764 \| 56066 \| 53371 \| 52822 \| 52872 \| 52769 \| 52527 \| 53508 epsilon200000(1%) \| 336181 \| 1569721 \| 1069355 \| 673718 \| 375043 \| 218230 \| 145393 \| 110926 \| 94327 \| 87039 \| 83926 \| 81890 \| 81787 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 12.48827602 \| 12.69859977 \| 13.20395535 \| 12.85611341 \| 12.63471362 \| 12.16125307 \| 12.59231689 \| 11.90189931 \| 12.46586483 \| 12.5299739 \| 12.39158158 \| 12.00121306 epsilon3000(67%) \| 1 \| 12.68344803 \| 13.2470174 \| 13.01263399 \| 11.85940553 \| 11.70270687 \| 11.26124276 \| 11.36866946 \| 11.62500958 \| 11.74315784 \| 11.39114225 \| 11.51853351 \| 11.64076404 epsilon4000(50%) \| 1 \| 7.250804619 \| 7.443154212 \| 7.003520161 \| 6.299395534 \| 6.020396133 \| 6.00929667 \| 5.996946625 \| 5.999177632 \| 5.973795551 \| 5.931812902 \| 5.747111345 \| 5.875618456 epsilon5000(40%) \| 1 \| 7.049176196 \| 7.155261444 \| 6.875243055 \| 6.055492126 \| 5.92269778 \| 5.930462108 \| 5.894213451 \| 5.712516249 \| 5.847491779 \| 6.024421292 \| 5.858405226 \| 5.965076595 epsilon10000(20%) \| 1 \| 5.312055644 \| 5.544786395 \| 5.505797539 \| 5.4765269 \| 5.774760681 \| 5.947900481 \| 5.98960748 \| 5.994239097 \| 5.93628549 \| 5.976561747 \| 6.002942714 \| 5.912852784 epsilon20000(10%) \| 1 \| 2.899132728 \| 3.328024306 \| 3.824911797 \| 4.544886796 \| 5.209745902 \| 5.64629187 \| 5.931404695 \| 5.993052137 \| 5.987384627 \| 5.999071425 \| 6.026710073 \| 5.916218136 epsilon200000(1%) \| 1 \| 0.214166084 \| 0.314377358 \| 0.498993644 \| 0.896379882 \| 1.540489392 \| 2.312222734 \| 3.03067811 \| 3.563995463 \| 3.862417997 \| 4.005683578 \| 4.105275369 \| 4.110445425 OpenBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 299119 \| 26047 \| 25049 \| 25239 \| 28001 \| 35138 \| 36438 \| 36279 \| 36114 \| 35111 \| 35428 \| 36295 \| 35197 epsilon3000(67%) \| 439798 \| 33321 \| 34423 \| 34336 \| 38906 \| 51756 \| 54138 \| 54085 \| 53412 \| 54766 \| 54425 \| 54221 \| 54842 epsilon4000(50%) \| 302963 \| 42960 \| 40678 \| 43483 \| 48254 \| 50888 \| 54990 \| 52647 \| 51947 \| 51843 \| 52891 \| 53410 \| 52020 epsilon5000(40%) \| 303569 \| 44225 \| 44961 \| 45065 \| 51768 \| 52776 \| 51930 \| 53587 \| 53104 \| 51833 \| 52138 \| 52574 \| 53756 epsilon10000(20%) \| 307403 \| 58447 \| 55993 \| 56757 \| 56694 \| 54038 \| 52734 \| 52073 \| 52051 \| 52150 \| 51986 \| 52407 \| 52390 epsilon20000(10%) \| 313344 \| 107580 \| 94679 \| 83329 \| 70226 \| 60996 \| 57130 \| 55461 \| 54641 \| 52712 \| 52541 \| 53101 \| 53312 epsilon200000(1%) \| 334679 \| 1642726 \| 1073148 \| 654481 \| 364974 \| 213881 \| 140248 \| 107579 \| 91757 \| 85090 \| 81940 \| 80492 \| 80250 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 11.48381771 \| 11.94135494 \| 11.85146004 \| 10.68243991 \| 8.512692811 \| 8.208985125 \| 8.244962651 \| 8.282632774 \| 8.519238985 \| 8.443011178 \| 8.241328007 \| 8.498423161 epsilon3000(67%) \| 1 \| 13.19882356 \| 12.7762833 \| 12.80865564 \| 11.30411762 \| 8.497526857 \| 8.123646976 \| 8.131607655 \| 8.234067251 \| 8.030493372 \| 8.080808452 \| 8.111211523 \| 8.01936472 epsilon4000(50%) \| 1 \| 7.052211359 \| 7.44783421 \| 6.967389555 \| 6.278505409 \| 5.953525389 \| 5.509419895 \| 5.754610899 \| 5.832155851 \| 5.843855487 \| 5.728063376 \| 5.672402172 \| 5.823971549 epsilon5000(40%) \| 1 \| 6.86419446 \| 6.751829363 \| 6.736247642 \| 5.864027971 \| 5.752027437 \| 5.845734643 \| 5.664974714 \| 5.716499699 \| 5.856674319 \| 5.822413595 \| 5.774127896 \| 5.647164968 epsilon10000(20%) \| 1 \| 5.259517169 \| 5.490025539 \| 5.416124883 \| 5.422143437 \| 5.688645028 \| 5.829313157 \| 5.903308816 \| 5.905803923 \| 5.894592522 \| 5.913188166 \| 5.865685882 \| 5.867589235 epsilon20000(10%) \| 1 \| 2.912660346 \| 3.309540658 \| 3.760323537 \| 4.461937174 \| 5.137123746 \| 5.48475407 \| 5.649807973 \| 5.734594901 \| 5.944452876 \| 5.963799699 \| 5.900905821 \| 5.87755102 epsilon200000(1%) \| 1 \| 0.203733915 \| 0.311866583 \| 0.511365494 \| 0.916994087 \| 1.564790701 \| 2.38633706 \| 3.111006795 \| 3.647449241 \| 3.933235398 \| 4.084439834 \| 4.157916315 \| 4.170454829 ### Does this PR introduce _any_ user-facing change? yes, param `blockSize` -> `blockSizeInMB` in master ### How was this patch tested? added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907)) Closes #30009 from zhengruifeng/adaptively_blockify_linear_svc_II. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-12 19:14:07 +08:00
Ruifeng Zheng	6244407ce6	Revert "[WIP] Test (#30327 )" This reverts commit `61ee5d8a4e`. ### What changes were proposed in this pull request? I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009, but I merged it to master by mistake. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 11:32:12 +09:00
WeichenXu	61ee5d8a4e	[WIP] Test (#30327 ) * resend * address comments * directly gen new Iter * directly gen new Iter * update blockify strategy * address comments * try to fix 2.13 * try to fix scala 2.13 * use 1.0 as the default value for gemv * update Co-authored-by: zhengruifeng <ruifengz@foxmail.com>	2020-11-12 10:20:33 +08:00
zero323	4b76a74f1c	[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ ### What changes were proposed in this pull request? Removes encoding of the JVM response in `pyspark.sql.column.Column.__repr__`. ### Why are the changes needed? API consistency and improved readability of the expressions. ### Does this PR introduce _any_ user-facing change? Before this change col("abc") col("wąż") result in Column<b'abc'> Column<b'w\xc4\x85\xc5\xbc'> After this change we'll get Column<'abc'> Column<'wąż'> ### How was this patch tested? Existing tests and manual inspection. Closes #30322 from zero323/SPARK-33415. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 00:13:17 +09:00
zero323	122c8999cb	[SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns PrefixSpan.findFrequentSequentialPatterns ### What changes were proposed in this pull request? Changes pyspark.sql.dataframe.DataFrame to :py:class:`pyspark.sql.DataFrame` ### Why are the changes needed? Consistency (see https://github.com/apache/spark/pull/30285#pullrequestreview-526764104). ### Does this PR introduce _any_ user-facing change? User will see shorter reference with a link. ### How was this patch tested? `dev/lint-python` and manual check of the rendered docs. Closes #30313 from zero323/SPARK-33251-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-11-10 09:17:00 -08:00
lrz	27bb40b629	[SPARK-33339][PYTHON] Pyspark application will hang due to non Exception error ### What changes were proposed in this pull request? When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang. The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck we can run a simple test to reproduce this case: ``` from pyspark.sql import SparkSession def err(line): raise SystemExit spark = SparkSession.builder.appName("test").getOrCreate() spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() spark.stop() ``` ### Why are the changes needed? to make sure pyspark application won't hang if there's non-Exception error in python worker ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added a new test and also manually tested the case above Closes #30248 from li36909/pyspark. Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 19:39:18 +09:00
neko	4360c6f12a	[SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts ### What changes were proposed in this pull request? add prompt information about current applicationId, current URL and master info when pyspark / sparkR starts. ### Why are the changes needed? The information printed when pyspark/sparkR starts does not prompt the basic information of current application, and it is not convenient when used pyspark/sparkR in dos. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual test result shows below: ![pyspark new print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png) ![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png) Closes #30266 from akiyamaneko/pyspark-hint-info. Authored-by: neko <echohlne@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:12:19 +09:00
zero323	090962cd42	[SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.ml` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30285 from zero323/SPARK-33251. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 09:33:48 +09:00
HyukjinKwon	e11a24c1ba	[SPARK-33371][PYTHON] Update setup.py and tests for Python 3.9 ### What changes were proposed in this pull request? This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9. Also, this PR fixes some minor fixes into the test codes. - `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed - Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons. NOTE that PyArrow does not support Python 3.9 yet. ### Why are the changes needed? To officially support Python 3.9. ### Does this PR introduce _any_ user-facing change? Yes, it officially supports Python 3.9. ### How was this patch tested? Manually ran the tests: ``` $ ./run-tests --python-executable=python Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming'] python python_implementation is CPython python version is: Python 3.9.0 Starting test(python): pyspark.ml.tests.test_base Starting test(python): pyspark.ml.tests.test_evaluation Starting test(python): pyspark.ml.tests.test_algorithms Starting test(python): pyspark.ml.tests.test_feature Finished test(python): pyspark.ml.tests.test_base (12s) Starting test(python): pyspark.ml.tests.test_image Finished test(python): pyspark.ml.tests.test_evaluation (15s) Starting test(python): pyspark.ml.tests.test_linalg Finished test(python): pyspark.ml.tests.test_feature (25s) Starting test(python): pyspark.ml.tests.test_param Finished test(python): pyspark.ml.tests.test_image (17s) Starting test(python): pyspark.ml.tests.test_persistence Finished test(python): pyspark.ml.tests.test_param (17s) Starting test(python): pyspark.ml.tests.test_pipeline Finished test(python): pyspark.ml.tests.test_linalg (30s) Starting test(python): pyspark.ml.tests.test_stat Finished test(python): pyspark.ml.tests.test_pipeline (6s) Starting test(python): pyspark.ml.tests.test_training_summary Finished test(python): pyspark.ml.tests.test_stat (12s) Starting test(python): pyspark.ml.tests.test_tuning Finished test(python): pyspark.ml.tests.test_algorithms (68s) Starting test(python): pyspark.ml.tests.test_wrapper Finished test(python): pyspark.ml.tests.test_persistence (51s) Starting test(python): pyspark.mllib.tests.test_algorithms Finished test(python): pyspark.ml.tests.test_training_summary (33s) Starting test(python): pyspark.mllib.tests.test_feature Finished test(python): pyspark.ml.tests.test_wrapper (19s) Starting test(python): pyspark.mllib.tests.test_linalg Finished test(python): pyspark.mllib.tests.test_feature (26s) Starting test(python): pyspark.mllib.tests.test_stat Finished test(python): pyspark.mllib.tests.test_stat (22s) Starting test(python): pyspark.mllib.tests.test_streaming_algorithms Finished test(python): pyspark.mllib.tests.test_algorithms (53s) Starting test(python): pyspark.mllib.tests.test_util Finished test(python): pyspark.mllib.tests.test_linalg (54s) Starting test(python): pyspark.sql.tests.test_arrow Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped Starting test(python): pyspark.sql.tests.test_catalog Finished test(python): pyspark.mllib.tests.test_util (11s) Starting test(python): pyspark.sql.tests.test_column Finished test(python): pyspark.sql.tests.test_catalog (16s) Starting test(python): pyspark.sql.tests.test_conf Finished test(python): pyspark.sql.tests.test_column (17s) Starting test(python): pyspark.sql.tests.test_context Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_dataframe Finished test(python): pyspark.sql.tests.test_conf (11s) Starting test(python): pyspark.sql.tests.test_datasources Finished test(python): pyspark.sql.tests.test_datasources (19s) Starting test(python): pyspark.sql.tests.test_functions Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_group Finished test(python): pyspark.sql.tests.test_functions (32s) Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_grouped_map Finished test(python): pyspark.sql.tests.test_group (19s) Starting test(python): pyspark.sql.tests.test_pandas_map Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_window Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped Starting test(python): pyspark.sql.tests.test_readwriter Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped Starting test(python): pyspark.sql.tests.test_serde Finished test(python): pyspark.sql.tests.test_serde (19s) Starting test(python): pyspark.sql.tests.test_session Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s) Starting test(python): pyspark.sql.tests.test_streaming Finished test(python): pyspark.sql.tests.test_readwriter (25s) Starting test(python): pyspark.sql.tests.test_types Finished test(python): pyspark.ml.tests.test_tuning (208s) Starting test(python): pyspark.sql.tests.test_udf Finished test(python): pyspark.sql.tests.test_session (31s) Starting test(python): pyspark.sql.tests.test_utils Finished test(python): pyspark.sql.tests.test_streaming (35s) Starting test(python): pyspark.streaming.tests.test_context Finished test(python): pyspark.sql.tests.test_types (34s) Starting test(python): pyspark.streaming.tests.test_dstream Finished test(python): pyspark.sql.tests.test_utils (14s) Starting test(python): pyspark.streaming.tests.test_kinesis Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python): pyspark.streaming.tests.test_listener Finished test(python): pyspark.streaming.tests.test_listener (11s) Starting test(python): pyspark.tests.test_appsubmit Finished test(python): pyspark.sql.tests.test_udf (39s) Starting test(python): pyspark.tests.test_broadcast Finished test(python): pyspark.streaming.tests.test_context (23s) Starting test(python): pyspark.tests.test_conf Finished test(python): pyspark.tests.test_conf (15s) Starting test(python): pyspark.tests.test_context Finished test(python): pyspark.tests.test_broadcast (33s) Starting test(python): pyspark.tests.test_daemon Finished test(python): pyspark.tests.test_daemon (5s) Starting test(python): pyspark.tests.test_install_spark Finished test(python): pyspark.tests.test_context (44s) Starting test(python): pyspark.tests.test_join Finished test(python): pyspark.tests.test_appsubmit (68s) Starting test(python): pyspark.tests.test_profiler Finished test(python): pyspark.tests.test_join (7s) Starting test(python): pyspark.tests.test_rdd Finished test(python): pyspark.tests.test_profiler (9s) Starting test(python): pyspark.tests.test_rddbarrier Finished test(python): pyspark.tests.test_rddbarrier (7s) Starting test(python): pyspark.tests.test_readwrite Finished test(python): pyspark.streaming.tests.test_dstream (107s) Starting test(python): pyspark.tests.test_serializers Finished test(python): pyspark.tests.test_serializers (8s) Starting test(python): pyspark.tests.test_shuffle Finished test(python): pyspark.tests.test_readwrite (14s) Starting test(python): pyspark.tests.test_taskcontext Finished test(python): pyspark.tests.test_install_spark (65s) Starting test(python): pyspark.tests.test_util Finished test(python): pyspark.tests.test_shuffle (8s) Starting test(python): pyspark.tests.test_worker Finished test(python): pyspark.tests.test_util (5s) Starting test(python): pyspark.accumulators Finished test(python): pyspark.accumulators (5s) Starting test(python): pyspark.broadcast Finished test(python): pyspark.broadcast (6s) Starting test(python): pyspark.conf Finished test(python): pyspark.tests.test_worker (14s) Starting test(python): pyspark.context Finished test(python): pyspark.conf (4s) Starting test(python): pyspark.ml.classification Finished test(python): pyspark.tests.test_rdd (60s) Starting test(python): pyspark.ml.clustering Finished test(python): pyspark.context (21s) Starting test(python): pyspark.ml.evaluation Finished test(python): pyspark.tests.test_taskcontext (69s) Starting test(python): pyspark.ml.feature Finished test(python): pyspark.ml.evaluation (26s) Starting test(python): pyspark.ml.fpm Finished test(python): pyspark.ml.clustering (45s) Starting test(python): pyspark.ml.functions Finished test(python): pyspark.ml.fpm (24s) Starting test(python): pyspark.ml.image Finished test(python): pyspark.ml.functions (17s) Starting test(python): pyspark.ml.linalg.__init__ Finished test(python): pyspark.ml.linalg.__init__ (0s) Starting test(python): pyspark.ml.recommendation Finished test(python): pyspark.ml.classification (74s) Starting test(python): pyspark.ml.regression Finished test(python): pyspark.ml.image (8s) Starting test(python): pyspark.ml.stat Finished test(python): pyspark.ml.stat (29s) Starting test(python): pyspark.ml.tuning Finished test(python): pyspark.ml.regression (53s) Starting test(python): pyspark.mllib.classification Finished test(python): pyspark.ml.tuning (35s) Starting test(python): pyspark.mllib.clustering Finished test(python): pyspark.ml.feature (103s) Starting test(python): pyspark.mllib.evaluation Finished test(python): pyspark.mllib.classification (33s) Starting test(python): pyspark.mllib.feature Finished test(python): pyspark.mllib.evaluation (21s) Starting test(python): pyspark.mllib.fpm Finished test(python): pyspark.ml.recommendation (103s) Starting test(python): pyspark.mllib.linalg.__init__ Finished test(python): pyspark.mllib.linalg.__init__ (1s) Starting test(python): pyspark.mllib.linalg.distributed Finished test(python): pyspark.mllib.feature (26s) Starting test(python): pyspark.mllib.random Finished test(python): pyspark.mllib.fpm (23s) Starting test(python): pyspark.mllib.recommendation Finished test(python): pyspark.mllib.clustering (50s) Starting test(python): pyspark.mllib.regression Finished test(python): pyspark.mllib.random (13s) Starting test(python): pyspark.mllib.stat.KernelDensity Finished test(python): pyspark.mllib.stat.KernelDensity (1s) Starting test(python): pyspark.mllib.stat._statistics Finished test(python): pyspark.mllib.linalg.distributed (42s) Starting test(python): pyspark.mllib.tree Finished test(python): pyspark.mllib.stat._statistics (19s) Starting test(python): pyspark.mllib.util Finished test(python): pyspark.mllib.regression (33s) Starting test(python): pyspark.profiler Finished test(python): pyspark.mllib.recommendation (36s) Starting test(python): pyspark.rdd Finished test(python): pyspark.profiler (9s) Starting test(python): pyspark.resource.tests.test_resources Finished test(python): pyspark.mllib.tree (19s) Starting test(python): pyspark.serializers Finished test(python): pyspark.mllib.util (21s) Starting test(python): pyspark.shuffle Finished test(python): pyspark.resource.tests.test_resources (9s) Starting test(python): pyspark.sql.avro.functions Finished test(python): pyspark.shuffle (1s) Starting test(python): pyspark.sql.catalog Finished test(python): pyspark.rdd (22s) Starting test(python): pyspark.sql.column Finished test(python): pyspark.serializers (12s) Starting test(python): pyspark.sql.conf Finished test(python): pyspark.sql.conf (6s) Starting test(python): pyspark.sql.context Finished test(python): pyspark.sql.catalog (14s) Starting test(python): pyspark.sql.dataframe Finished test(python): pyspark.sql.avro.functions (15s) Starting test(python): pyspark.sql.functions Finished test(python): pyspark.sql.column (24s) Starting test(python): pyspark.sql.group Finished test(python): pyspark.sql.context (20s) Starting test(python): pyspark.sql.pandas.conversion Finished test(python): pyspark.sql.pandas.conversion (13s) Starting test(python): pyspark.sql.pandas.group_ops Finished test(python): pyspark.sql.group (36s) Starting test(python): pyspark.sql.pandas.map_ops Finished test(python): pyspark.sql.pandas.group_ops (21s) Starting test(python): pyspark.sql.pandas.serializers Finished test(python): pyspark.sql.pandas.serializers (0s) Starting test(python): pyspark.sql.pandas.typehints Finished test(python): pyspark.sql.pandas.typehints (0s) Starting test(python): pyspark.sql.pandas.types Finished test(python): pyspark.sql.pandas.types (0s) Starting test(python): pyspark.sql.pandas.utils Finished test(python): pyspark.sql.pandas.utils (0s) Starting test(python): pyspark.sql.readwriter Finished test(python): pyspark.sql.dataframe (56s) Starting test(python): pyspark.sql.session Finished test(python): pyspark.sql.functions (57s) Starting test(python): pyspark.sql.streaming Finished test(python): pyspark.sql.pandas.map_ops (12s) Starting test(python): pyspark.sql.types Finished test(python): pyspark.sql.types (10s) Starting test(python): pyspark.sql.udf Finished test(python): pyspark.sql.streaming (16s) Starting test(python): pyspark.sql.window Finished test(python): pyspark.sql.session (19s) Starting test(python): pyspark.streaming.util Finished test(python): pyspark.streaming.util (0s) Starting test(python): pyspark.util Finished test(python): pyspark.util (0s) Finished test(python): pyspark.sql.readwriter (24s) Finished test(python): pyspark.sql.udf (13s) Finished test(python): pyspark.sql.window (14s) Tests passed in 780 seconds ``` Closes #30277 from HyukjinKwon/SPARK-33371. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-06 15:05:37 -08:00
HyukjinKwon	d530ed0ea8	Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends" This reverts commit `b8a440f098`.	2020-11-05 16:15:17 +09:00
zero323	4c8ee8856c	[SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc, desc) ### What changes were proposed in this pull request? This PR adds support for passing `Column`s as input to PySpark sorting functions. ### Why are the changes needed? According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible. ### Does this PR introduce _any_ user-facing change? PySpark users can now provide both `Column` and `str` as an argument for `asc` and `desc` functions. ### How was this patch tested? New unit tests. Closes #30227 from zero323/SPARK-33257. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-03 22:50:59 +09:00
HyukjinKwon	3959f0d987	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*) ### What changes were proposed in this pull request? This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243. While I am migrating, I also fixed some Python type hints accordingly. ### Why are the changes needed? For better documentation as text itself, and generated HTMLs ### Does this PR introduce _any_ user-facing change? Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243. ### How was this patch tested? Manually tested via running `./dev/lint-python`. Closes #30181 from HyukjinKwon/SPARK-33250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-03 10:00:49 +09:00
Max Gekk	bdabf60fb4	[SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for `from_json` ### What changes were proposed in this pull request? Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs. Closes #30201 ### Why are the changes needed? Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By linters. Closes #30226 from MaxGekk/from_json-common-schema-parsing-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:10:24 -08:00
Takuya UESHIN	b8a440f098	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 20:28:12 +09:00
Daniel Himmelstein	56587f076d	[SPARK-33310][PYTHON] Relax pyspark typing for sql str functions ### What changes were proposed in this pull request? Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable. ### Why are the changes needed? Convenience & ensuring the typing reflects the functionality ### Does this PR introduce _any_ user-facing change? Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions. ### How was this patch tested? Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this. Closes #30209 from dhimmel/patch-1. Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 19:09:12 +09:00
Max Gekk	b409025641	[SPARK-33281][SQL] Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression. ### Why are the changes needed? To unify output of the `schema_of_json()` and `schema_of_csv()`. ### Does this PR introduce _any_ user-facing change? Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter. Before: ``` > SELECT schema_of_csv('1,abc'); struct<_c0:int,_c1:string> ``` After: ``` > SELECT schema_of_csv('1,abc'); STRUCT<`_c0`: INT, `_c1`: STRING> ``` ### How was this patch tested? By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`. Closes #30180 from MaxGekk/schema_of_csv-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 21:02:10 +09:00
Max Gekk	9d5e48ea95	[SPARK-33270][SQL] Return SQL schema instead of Catalog string from the `SchemaOfJson` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression. ### Why are the changes needed? In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`. Here is the example: ```scala val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") ``` raises the exception: ``` == SQL == struct<a b:bigint> ------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) ``` ### Does this PR introduce _any_ user-facing change? Yes. For example, `schema_of_json` for the input `{"col":0}`. Before: `struct<col:bigint>` After: `STRUCT<`col`: BIGINT>` ### How was this patch tested? By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`. Closes #30172 from MaxGekk/schema_of_json-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:30:41 +09:00
Takeshi Yamamuro	a6216e2446	[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType ### What changes were proposed in this pull request? This PR intends to fix bus for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows; ``` >>> from pyspark.sql import Row >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import * >>> from pyspark.testing.sqlutils import * >>> >>> row = Row(point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select jdf = self._jdf.select(self._jcols(cols)) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco return f(a, **kw) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.select. : java.lang.NullPointerException at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84) at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96) at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290) ``` A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`. ### Why are the changes needed? Bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30169 from maropu/FixPythonUDTCast. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-28 08:33:02 -07:00
HyukjinKwon	9818f079aa	[SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency ### What changes were proposed in this pull request? This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings. This PR also adds one migration example of `SparkContext`. - Before: ... ![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png) ... ![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png) ... - After: ... ![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png) ... See also https://numpydoc.readthedocs.io/en/latest/format.html ### Why are the changes needed? There are many reasons for switching to NumPy documentation style. 1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax. 2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`. 3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas, matplotlib, ... Using NumPy documentation style can give users a consistent documentation style. ### Does this PR introduce _any_ user-facing change? The dependency itself doesn't change anything user-facing. The documentation change in `SparkContext` does, as shown above. ### How was this patch tested? Manually tested via running `cd python` and `make clean html`. Closes #30149 from HyukjinKwon/SPARK-33243. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 14:03:57 +09:00
zero323	4e6a310f80	[SPARK-32084][PYTHON][SQL] Expand dictionary functions ### What changes were proposed in this pull request? - [x] Expand dictionary definitions into standalone functions. - [x] Fix annotations for ordering functions. ### Why are the changes needed? To simplify further maintenance of docstrings. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30143 from zero323/SPARK-32084. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 11:05:53 +09:00
HyukjinKwon	7cdc921bc0	[SPARK-32188][PYTHON][DOCS][FOLLOW-UP] Document Column APIs in API reference ### What changes were proposed in this pull request? This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation. ### Why are the changes needed? To document common APIs in PySpark. ### Does this PR introduce _any_ user-facing change? Yes, `Column.*` will be shown in API reference page. ### How was this patch tested? Manually tested via `cd python` and `make clean html`. Closes #30150 from HyukjinKwon/SPARK-32188. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 09:52:09 +09:00
zero323	d7f15b025b	[SPARK-33003][PYTHON][DOCS] Add type hints guidelines to the documentation ### What changes were proposed in this pull request? Add type hints guidelines to developer docs. ### Why are the changes needed? Since it is a new and still somewhat evolving feature, we should provided clear guidelines for potential contributors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #30094 from zero323/SPARK-33003. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-24 10:00:04 +09:00
Alessandro Patti	4a33cd928d	[SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors ### What changes were proposed in this pull request? Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment) ### Why are the changes needed? The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with ``` File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1)) AssertionError: False is not true ``` ``` File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS Failed example: predictions[0] Expected: Row(user=0, item=2, newPrediction=0.6929101347923279) Got: Row(user=0, item=2, newPrediction=0.6929104924201965) ... ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This path changes a test target. Just executed the tests to verify they pass. Closes #30104 from AlessandroPatti/apatti/rounding-errors. Authored-by: Alessandro Patti <ale812@yahoo.it> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-21 18:14:21 -07:00
HyukjinKwon	66005a3236	[SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical ### What changes were proposed in this pull request? This PR is a small followup of https://github.com/apache/spark/pull/28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`. `is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (`87a1cc21ca`). ### Why are the changes needed? To avoid using deprecated APIs, and remove warnings. ### Does this PR introduce _any_ user-facing change? Yes, it will remove warnings that says `is_categorical` is deprecated. ### How was this patch tested? By running any pandas UDF with pandas 1.1.0+: ```python import pandas as pd from pyspark.sql.functions import pandas_udf def func(x: pd.Series) -> pd.Series: return x spark.range(10).select(pandas_udf(func, "long")("id")).show() ``` Before: ``` /.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead ... ``` After: ``` ... ``` Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-10-21 14:46:47 -07:00
Bryan Cutler	47a6568265	[SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested timestamps in pyarrow ### What changes were proposed in this pull request? Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion. ### Why are the changes needed? The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-21 09:13:33 +09:00
xuewei.linxuewei	388e067a90	[SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py ### What changes were proposed in this pull request? In [SPARK-33139](https://github.com/apache/spark/pull/30042), I was using reflect "Class.forName" in python code to invoke method in SparkSession which is not recommended. using getattr to access "SparkSession$.Module$" instead. ### Why are the changes needed? Code refine. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #30092 from leanken/leanken-SPARK-33139-followup. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 16:40:48 +09:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
Chuliang Xiao	81d3a8eeca	[MINOR][PYTHON] Fix the typo in the docstring of method agg() ### What changes were proposed in this pull request? Change `df.groupBy.agg()` to `df.groupBy().agg()` in the docstring of `agg()` ### Why are the changes needed? Fix typo in a docstring ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #30060 from ChuliangXiao/patch-1. Authored-by: Chuliang Xiao <ChuliangX@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-10-15 17:24:22 -07:00

... 2 3 4 5 6 ...

2923 commits