ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xinrong Meng	88bba0c94b	[SPARK-36697][PYTHON] Fix dropping all columns of a DataFrame ### What changes were proposed in this pull request? Fix dropping all columns of a DataFrame ### Why are the changes needed? When dropping all columns of a pandas-on-Spark DataFrame, a ValueError is raised. Whereas in pandas, an empty DataFrame reserving the index is returned. We should follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. From ```py >>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]}) >>> psdf x y z 0 1 3 5 1 2 4 6 >>> psdf.drop(['x', 'y', 'z']) Traceback (most recent call last): ... ValueError: not enough values to unpack (expected 2, got 0) ``` To ```py >>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]}) >>> psdf x y z 0 1 3 5 1 2 4 6 >>> psdf.drop(['x', 'y', 'z']) Empty DataFrame Columns: [] Index: [0, 1] ``` ### How was this patch tested? Unit tests. Closes #33938 from xinrong-databricks/frame_drop_col. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `33bb7b39e9`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-09 09:59:50 +09:00
itholic	3d50760a3e	[SPARK-36531][SPARK-36515][PYTHON] Improve test coverage for data_type_ops/* and groupby ### What changes were proposed in this pull request? This PR proposes improving test coverage for pandas-on-Spark data types & GroupBy code base, which is written in `data_type_ops/.py` and `groupby.py` separately. This PR did the following to improve coverage: - Add unittest for untested code - Fix unittest which is not tested properly - Remove unused code NOTE*: This PR is not only include the test-only update, for example it includes the fixing `astype` for binary ops. pandas-on-Spark Series we have: ```python >>> psser 0 [49] 1 [50] 2 [51] dtype: object ``` before: ```python >>> psser.astype(bool) Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type mismatch: cannot cast binary to boolean; ... ``` after: ```python >>> psser.astype(bool) 0 True 1 True 2 True dtype: bool ``` ### Why are the changes needed? To make the project healthier by improving coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittest. Closes #33850 from itholic/SPARK-36531. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `71dbd03fbe`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-08 10:23:06 +09:00
Cary Lee	11d10fc994	[SPARK-36617][PYTHON] Fix type hints for `approxQuantile` to support multi-column version ### What changes were proposed in this pull request? Update both `DataFrame.approxQuantile` and `DataFrameStatFunctions.approxQuantile` to support overloaded definitions when multiple columns are supplied. ### Why are the changes needed? The current type hints don't support the multi-column signature, a form that was added in Spark 2.2 (see [the approxQuantile docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html).) This change was also introduced to pyspark-stubs (https://github.com/zero323/pyspark-stubs/pull/552). zero323 asked me to open a PR for the upstream change. ### Does this PR introduce _any_ user-facing change? This change only affects type hints - it brings the `approxQuantile` type hints up to date with the actual code. ### How was this patch tested? Ran `./dev/lint-python`. Closes #33880 from carylee/master. Authored-by: Cary Lee <cary@amperity.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com> (cherry picked from commit `37f5ab07fa`) Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-09-02 15:03:08 +02:00
Gengliang Wang	1bad04d028	Preparing development version 3.2.1-SNAPSHOT	2021-08-31 17:04:14 +00:00
Gengliang Wang	03f5d23e96	Preparing Spark release v3.2.0-rc2	2021-08-31 17:04:08 +00:00
itholic	396b76466b	[SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3 This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests. Some tests are missing No Unittest Closes #33776 from itholic/SPARK-36388-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `c91ae544fd`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 20:46:54 +09:00
Huaxin Gao	786d773585	[SPARK-36578][ML] UnivariateFeatureSelector API doc improvement ### What changes were proposed in this pull request? Change API doc for `UnivariateFeatureSelector` ### Why are the changes needed? make the doc look better ### Does this PR introduce _any_ user-facing change? yes, API doc change ### How was this patch tested? Manually checked Closes #33855 from huaxingao/ml_doc. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `15e42b4442`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-08-26 21:16:59 -07:00
Hyukjin Kwon	069f326e36	[MINOR] Address conflicts for SPARK-36367 cherry-pick	2021-08-27 10:24:18 +09:00
itholic	2dc15d9d84	[SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype This PR proposes to enable the tests, disabled since different behavior with pandas 1.3. - `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them. - Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`. We should enable the tests as much as possible even if pandas has a bug. And we should follow the behavior of latest pandas. Yes, `GroupBy.transform` now follow the behavior of latest pandas. Unittests. Closes #33817 from itholic/SPARK-36537. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fe486185c4`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:00:23 +09:00
itholic	8829406366	[SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3 This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3. Before: ```python >>> pcat 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> pcat.astype(CategoricalDtype(["b", "c", "a"])) 0 a 1 b 2 c dtype: category Categories (3, object): ['b', 'c', 'a'] ``` After: ```python >>> pcat 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> pcat.astype(CategoricalDtype(["b", "c", "a"])) 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] # CategoricalDtype is not updated if dtype is the same ``` `CategoricalDtype` is treated as a same `dtype` if the unique values are the same. ```python >>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"])) >>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"])) >>> pcat1.dtype == pcat2.dtype True ``` We should follow the latest pandas as much as possible. Yes, the behavior is changed as example in the PR description. Unittest Closes #33757 from itholic/SPARK-36368. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `f2e593bcf1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:00:12 +09:00
itholic	31557d4759	[SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3. In pandas < 1.3, ```python >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string") 0 2020-10-27 00:00:01 1 NaT Name: datetime, dtype: string ``` This is changed to ```python >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string") 0 2020-10-27 00:00:01 1 <NA> Name: datetime, dtype: string ``` in pandas >= 1.3, so we follow the behavior of latest pandas. Because pandas-on-Spark always follow the behavior of latest pandas. Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype) Unittest passed Closes #33735 from itholic/SPARK-36387. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `c0441bb7e8`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 10:00:01 +09:00
itholic	0fc8c393b4	[SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3 This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior. `RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3. Before: ```python >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) >>> df.groupby("A").rolling(2).sum() A B A 1 0 NaN NaN 1 2.0 1.0 2 2 NaN NaN 3 3 NaN NaN ``` After: ```python >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) >>> df.groupby("A").rolling(2).sum() B A 1 0 NaN 1 1.0 2 2 NaN 3 3 NaN ``` We should follow the behavior of pandas as much as possible. Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above. Unit tests. Closes #33646 from itholic/SPARK-36388. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `b8508f4876`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 09:59:48 +09:00
itholic	f2f09e4cdb	[SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3 This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3. Before: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64') ``` After: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64') ``` This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289. We should follow the behavior of pandas as much as possible. Yes, the result for some cases have duplicates values will change. Unit test. Closes #33634 from itholic/SPARK-36369. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a9f371c247`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 09:59:32 +09:00
Takuya UESHIN	cb075b5301	[SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3 Disable tests failed by the incompatible behavior of pandas 1.3. Pandas 1.3 has been released. There are some behavior changes and we should follow it, but it's not ready yet. No. Disabled some tests related to the behavior change. Closes #33598 from ueshin/issues/SPARK-36367/disable_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8cb9cf39b6`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-27 09:58:42 +09:00
itholic	0feb19c53a	[SPARK-36505][PYTHON] Improve test coverage for frame.py ### What changes were proposed in this pull request? This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`. This PR did the following to improve coverage: - Add unittest for untested code - Remove unused code - Add arguments to some functions for testing ### Why are the changes needed? To make the project healthier by improving coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittest. Closes #33833 from itholic/SPARK-36505. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `97e7d6e667`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 17:43:12 +09:00
Dongjoon Hyun	d841679ecc	[MINOR][3.2] Remove unused `numpy` import ### What changes were proposed in this pull request? This fixed Python linter failure. ### Why are the changes needed? ``` flake8 checks failed: ./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused import numpy as np F401 'numpy as np' imported but unused Error: Process completed with exit code 1. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action Linter job. Closes #33841 from dongjoon-hyun/unused_import. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 09:52:54 +09:00
Hyukjin Kwon	26ae9e93da	[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` Before: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` After: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `93cec49212`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:03:00 +09:00
Gengliang Wang	5463caac0d	Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization" ### What changes were proposed in this pull request? Revert `397b843890` and `5a48eb8d00` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci tests Closes #33819 from gengliangwang/revert-SPARK-34415. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `de932f51ce`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:39:29 -07:00
Xinrong Meng	56c211bd6a	[SPARK-36470][PYTHON] Implement `CategoricalIndex.map` and `DatetimeIndex.map` Implement `CategoricalIndex.map` and `DatetimeIndex.map` `MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary. Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well. Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now. - CategoricalIndex.map ```py >>> idx = ps.CategoricalIndex(['a', 'b', 'c']) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.map(lambda x: x.upper()) CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, dtype='category') >>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True)) >>> idx.map(pser) CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category') >>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category') ``` - DatetimeIndex.map ```py >>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10") >>> psidx = ps.from_pandas(pidx) >>> mapper_dict = { ... datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8), ... datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9), ... } >>> psidx.map(mapper_dict) DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None) >>> mapper_pser = pd.Series([1, 2, 3], index=pidx) >>> psidx.map(mapper_pser) Int64Index([1, 2, 3], dtype='int64') >>> psidx DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None) >>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r")) Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM', 'August 10, 2020, 12:00:00 AM'], dtype='object') ``` Unit tests. Closes #33756 from xinrong-databricks/other_indexes_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `0b6af464dc`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-23 10:11:21 +09:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Gengliang Wang	8e58fafb05	Preparing Spark release v3.2.0-rc1	2021-08-19 14:08:26 +00:00
Takuya UESHIN	528fca8944	[SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version ### What changes were proposed in this pull request? This is a follow-up of #33687. Use `LooseVersion` instead of `pkg_resources.parse_version`. ### Why are the changes needed? In the previous PR, `pkg_resources.parse_version` was used, but we should use `LooseVersion` instead to be consistent in the code base. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `7fb8ea319e`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-18 10:36:17 +09:00
Cedric-Magnan	e15daa31b3	[SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined ### What changes were proposed in this pull request? Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module. Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class. ### Why are the changes needed? This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the following command successfully : ```sh python/run-tests --testnames 'pyspark.pandas.tests.test_groupby' ``` Tests passed in 327 seconds Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas. Authored-by: Cedric-Magnan <cedric.magnan@artefact.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `964dfe254f`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-17 10:47:01 -07:00
Xinrong Meng	cb14a32005	[SPARK-36469][PYTHON] Implement Index.map ### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `4dcd746025`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-16 11:06:23 -07:00
Liang-Chi Hsieh	3aa933b162	[SPARK-36465][SS] Dynamic gap duration in session window ### What changes were proposed in this pull request? This patch supports dynamic gap duration in session window. ### Why are the changes needed? The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows. ### Does this PR introduce _any_ user-facing change? Yes, users can specify dynamic gap duration. ### How was this patch tested? Modified existing tests and new test. Closes #33691 from viirya/dynamic-session-window-gap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `8b8d91cf64`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-16 11:06:16 +09:00
itholic	f7ab2bfc8c	[SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io ### What changes were proposed in this pull request? This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message. ### Why are the changes needed? To improve the warning message. ### Does this PR introduce _any_ user-facing change? The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead." ### How was this patch tested? Manually run `dev/lint-python` Closes #33631 from itholic/SPARK-35811-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3d72c20e64`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 16:20:37 +09:00
Xinrong Meng	c22a25b76a	[SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists ### What changes were proposed in this pull request? Better error messages for DataTypeOps against lists. ### Why are the changes needed? Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead. ### Does this PR introduce _any_ user-facing change? Yes. A TypeError message will be showed rather than a Py4JJavaError. From: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt. : java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1] ... ``` To: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... TypeError: The operation can not be applied to list. ``` ### How was this patch tested? Unit tests. Closes #33581 from xinrong-databricks/data_type_ops_list. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8ca11fe39f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 16:25:59 +09:00
Takuya UESHIN	71e4c56974	[SPARK-36367][3.2][PYTHON] Partially backport to avoid unexpected error with pandas 1.3 ### What changes were proposed in this pull request? Partially backport from #33598 to avoid unexpected error caused by pandas 1.3. ### Why are the changes needed? If uses tries to use pandas 1.3 as the underlying pandas, it will raise unexpected errors caused by removed APIs or behavior change. Note that pandas API on Spark 3.2 will still follow the pandas 1.2 behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33614 from ueshin/issues/SPARK-36367/3.2/partially_backport. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 14:03:35 +09:00
Linhong Liu	e26cb968bd	[SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2f700773c2`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 23:20:11 +08:00
Hyukjin Kwon	310cd8eef1	[SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job. For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members). Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used. Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same. I manually tested: - Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484 - Coverage report: `73f0291a7d/python/pyspark` - Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175 Closes #33591 from HyukjinKwon/SPARK-36092. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c0d1860f25`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 21:38:39 +09:00
Yikun Jiang	5548cd5a7f	[SPARK-35976][PYTHON] Adjust astype method for ExtensionDtype in pandas API on Spark ### What changes were proposed in this pull request? This patch set value to `<NA>` (pd.NA) in BooleanExtensionOps and StringExtensionOps. ### Why are the changes needed? The pandas behavior: ```python >>> pd.Series([True, False, None], dtype="boolean").astype(str).tolist() ['True', 'False', '<NA>'] >>> pd.Series(['s1', 's2', None], dtype="string").astype(str).tolist() ['1', '2', '<NA>'] ``` pandas on spark ```python >>> import pandas as pd >>> from pyspark import pandas as ps # Before >>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist() ['True', 'False', 'None'] >>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist() ['True', 'False', 'None'] # After >>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist() ['True', 'False', '<NA>'] >>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist() ['s1', 's2', '<NA>'] ``` See more in [SPARK-35976](https://issues.apache.org/jira/browse/SPARK-35976) ### Does this PR introduce _any_ user-facing change? Yes, return `<NA>` when None to follow the pandas behavior ### How was this patch tested? Change the ut to cover this scenario. Closes #33585 from Yikun/SPARK-35976. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `f04e991e6a`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 10:37:35 +09:00
Takuya UESHIN	5d37a77d44	[SPARK-36365][PYTHON] Remove old workarounds related to null ordering ### What changes were proposed in this pull request? Remove old workarounds related to null ordering. ### Why are the changes needed? In pandas-on-Spark, there are still some remaining places to call `Column._jc.(asc\|desc)_nulls_(first\|last)` as a workaround from Koalas to support Spark 2.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified a couple of tests and existing tests. Closes #33597 from ueshin/issues/SPARK-36365/nulls_first_last. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `90d31dfcb7`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 10:33:43 +09:00
Hyukjin Kwon	187aa6ab7f	[SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33570, which mistakenly changed the default value of the default index ### Why are the changes needed? It was mistakenly changed. It was changed to check if the tests actually pass but I forgot to change it back. ### Does this PR introduce _any_ user-facing change? No, it's not related yet. It fixes up the mistake of the default value mistakenly changed. (Changed default value makes the test flaky because of the order affected by extra shuffle) ### How was this patch tested? Manually tested. Closes #33596 from HyukjinKwon/SPARK-36338-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `74a6b9d23b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-31 08:31:19 +09:00
Takuya UESHIN	a4dcda1794	[SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps ### What changes were proposed in this pull request? Move some logic related to `F.nanvl` to `DataTypeOps`. ### Why are the changes needed? There are several places to branch by `FloatType` or `DoubleType` to use `F.nanvl` but `DataTypeOps` should handle it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33582 from ueshin/issues/SPARK-36350/nan_to_null. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `895e3f5e2a`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-30 11:20:01 -07:00
Hyukjin Kwon	fee87f13d1	[SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side ### What changes were proposed in this pull request? This PR proposes to implement `distributed-sequence` index in Scala side. ### Why are the changes needed? - Avoid unnecessary (de)serialization - Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104) ### Does this PR introduce _any_ user-facing change? No to end users since pandas API on Spark is not released yet. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(1).spark.print_schema() ``` Before: ``` root \|-- id: long (nullable = true) ``` After: ``` root \|-- id: long (nullable = false) ``` ### How was this patch tested? Manually tested, and existing tests should cover them. Closes #33570 from HyukjinKwon/SPARK-36338. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c6140d4d0a`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 22:29:31 +09:00
Hyukjin Kwon	9cd370894b	[SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark ### What changes were proposed in this pull request? This PR is a partial revert of https://github.com/apache/spark/pull/33567 that keeps the logic to skip mlflow related tests if that's not installed. ### Why are the changes needed? It's consistent with other libraries, e.g) PyArrow. It also fixes up the potential dev breakage (see also https://github.com/apache/spark/pull/33567#issuecomment-889841829) ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? This is a partial revert. CI should test it out too. Closes #33589 from HyukjinKwon/SPARK-36254. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `dd2ca0aee2`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 22:28:29 +09:00
itholic	a9c5b1a5c8	[SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI ### What changes were proposed in this pull request? This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark. ### Why are the changes needed? To enable the MLflow test in pandas API on Spark. ### Does this PR introduce _any_ user-facing change? No, it's test-only ### How was this patch tested? Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`. Closes #33567 from itholic/SPARK-36254. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `abce61f3fd`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-30 00:04:59 -07:00
itholic	26b8297fa3	[SPARK-35806][PYTHON][FOLLOW-UP] Mapping the mode argument to pandas in DataFrame.to_csv ### What changes were proposed in this pull request? This PR is follow-up for https://github.com/apache/spark/pull/33414 to support the more options for `mode` argument for all APIs that has `mode` argument, not only `DataFrame.to_csv`. ### Why are the changes needed? To keep the usage consistency for the arguments that have same name. ### Does this PR introduce _any_ user-facing change? More options is available for all APIs that has `mode` argument, same as `DataFrame.to_csv` ### How was this patch tested? Manually test on local Closes #33569 from itholic/SPARK-35085-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `94cb2bbbc2`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 12:48:36 +09:00
Takuya UESHIN	359dfd7ec2	[SPARK-36333][PYTHON] Reuse isnull where the null check is needed ### What changes were proposed in this pull request? Reuse `IndexOpsMixin.isnull()` where the null check is needed. ### Why are the changes needed? There are some places where we can reuse `IndexOpsMixin.isnull()` instead of directly using Spark `Column`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33562 from ueshin/issues/SPARK-36333/reuse_isnull. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `07ed82be0b`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-29 15:33:25 -07:00
Samuel Moseley	e42324982d	[SPARK-36161][PYTHON] Add type check on dropDuplicates pyspark function ### What changes were proposed in this pull request? Improve the error message for wrong type when calling dropDuplicates in pyspark. ### Why are the changes needed? The current error message is cryptic and can be unclear to less experienced users. ### Does this PR introduce _any_ user-facing change? Yes, it adds a type error for when a user gives the wrong type to dropDuplicates ### How was this patch tested? There is currently no testing for error messages in pyspark dataframe functions Closes #33364 from sammyjmoseley/sm/add-type-checking-for-drop-duplicates. Lead-authored-by: Samuel Moseley <smoseley@palantir.com> Co-authored-by: Sammy Moseley <moseley.sammy@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a07df1acc6`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-29 19:11:57 +09:00
Xinrong Meng	999cf81653	[SPARK-36190][PYTHON] Improve the rest of DataTypeOps tests by avoiding joins ### What changes were proposed in this pull request? Improve the rest of DataTypeOps tests by avoiding joins. ### Why are the changes needed? bool, string, numeric DataTypeOps tests have been improved by avoiding joins. We should improve the rest of the DataTypeOps tests in the same way. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33546 from xinrong-databricks/test_no_join. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `9c5cb99d6e`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-28 15:53:51 -07:00
Xinrong Meng	bdf1570911	[SPARK-36143][PYTHON] Adjust `astype` of fractional Series with missing values to follow pandas ### What changes were proposed in this pull request? Adjust `astype` of fractional Series with missing values to follow pandas. Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas. ### Why are the changes needed? `astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. From: ```py >>> import numpy as np >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, np.nan]) >>> psser.astype(int) 0 1.0 1 2.0 2 NaN dtype: float64 ``` To: ```py >>> import numpy as np >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, np.nan]) >>> psser.astype(int) Traceback (most recent call last): ... ValueError: Cannot convert fractions with missing values to integer ``` ### How was this patch tested? Unit tests. Closes #33466 from xinrong-databricks/extension_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `01213095e2`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-28 11:26:58 -07:00
Takuya UESHIN	c9af94ecb4	[SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns ### What changes were proposed in this pull request? Fix `Series`/`Index.copy()` to drop extra columns. ### Why are the changes needed? Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns. We can drop those when `Series`/`Index.copy()`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3c76a924ce`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-28 18:40:03 +09:00
Takuya UESHIN	0e9e737a84	[SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any() ### What changes were proposed in this pull request? Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`. ### Why are the changes needed? `IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error. Also it returns a wrong value when the `Series`/`Index` is empty. ```py >>> ps.Series([]).hasnans None ``` whereas: ```py >>> pd.Series([]).hasnans False ``` `IndexOpsMixin.any()` is safe for both cases. ### Does this PR introduce _any_ user-facing change? `IndexOpsMixin.hasnans` will return `False` when empty. ### How was this patch tested? Added some tests. Closes #33547 from ueshin/issues/SPARK-36310/hasnan. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `bcc595c112`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-28 09:21:22 +09:00
Luran He	8a3b1cd811	[SPARK-36211][PYTHON] Correct typing of `udf` return value The following code should type-check: ```python3 import uuid import pyspark.sql.functions as F my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic() ``` ### What changes were proposed in this pull request? The `udf` function should return a more specific type. ### Why are the changes needed? Right now, `mypy` will throw spurious errors, such as for the code given above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests? Closes #33399 from luranhe/patch-1. Lead-authored-by: Luran He <luranjhe@gmail.com> Co-authored-by: Luran He <luran.he@compass.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com> (cherry picked from commit `ede1bc6b51`) Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-07-27 09:09:11 +02:00
Takuya UESHIN	f278f771e6	[SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Clean up `CategoricalAccessor` and `CategoricalIndex`. - Clean up the classes - Add deprecation warnings - Clean up the docs ### Why are the changes needed? To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33528 from ueshin/issues/SPARK-36267/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c40d9d46f1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:17:26 +09:00

1 2 3 4 5 ...

2756 commits