ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Huaxin Gao	15e42b4442	[SPARK-36578][ML] UnivariateFeatureSelector API doc improvement ### What changes were proposed in this pull request? Change API doc for `UnivariateFeatureSelector` ### Why are the changes needed? make the doc look better ### Does this PR introduce _any_ user-facing change? yes, API doc change ### How was this patch tested? Manually checked Closes #33855 from huaxingao/ml_doc. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-08-26 21:16:49 -07:00
Leona Yoda	aeb3da2798	[SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark ### What changes were proposed in this pull request? Replace images in pyspark on pandas document because those images uses the word Koalas ### Why are the changes needed? Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR . https://github.com/apache/spark/pull/32835 I think we have to match the word on that images ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `make html` Screen shots ![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png) ![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png) ![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png) ![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png) Environment - Windows 10 - Google Chrome 92.0.4515.159 [images.pptx](https://github.com/apache/spark/files/7029087/images.pptx) Closes #33786 from yoda-mon/replace-pyspark-doc-images. Authored-by: Leona Yoda <yodal@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 19:03:02 +09:00
itholic	fe486185c4	[SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype ### What changes were proposed in this pull request? This PR proposes to enable the tests, disabled since different behavior with pandas 1.3. - `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them. - Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`. ### Why are the changes needed? We should enable the tests as much as possible even if pandas has a bug. And we should follow the behavior of latest pandas. ### Does this PR introduce _any_ user-facing change? Yes, `GroupBy.transform` now follow the behavior of latest pandas. ### How was this patch tested? Unittests. Closes #33817 from itholic/SPARK-36537. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 17:43:49 +09:00
itholic	97e7d6e667	[SPARK-36505][PYTHON] Improve test coverage for frame.py ### What changes were proposed in this pull request? This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`. This PR did the following to improve coverage: - Add unittest for untested code - Remove unused code - Add arguments to some functions for testing ### Why are the changes needed? To make the project healthier by improving coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittest. Closes #33833 from itholic/SPARK-36505. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 17:43:00 +09:00
Hyukjin Kwon	93cec49212	[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` Before: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` After: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:02:53 +09:00
Gengliang Wang	de932f51ce	Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization" ### What changes were proposed in this pull request? Revert `397b843890` and `5a48eb8d00` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci tests Closes #33819 from gengliangwang/revert-SPARK-34415. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:38:14 -07:00
Hyukjin Kwon	fa53aa06d1	[SPARK-36560][PYTHON][INFRA] Deflake PySpark coverage job ### What changes were proposed in this pull request? This PR proposes to increase timeouts for: - `pyspark.sql.tests.test_streaming.StreamingTests.test_parameter_accuracy` - `pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests. test_parameter_accuracy` to deflake PySpark coverage build: - https://github.com/apache/spark/runs/3392972609?check_suite_focus=true - https://github.com/apache/spark/runs/3388727798?check_suite_focus=true - https://github.com/apache/spark/runs/3359880048?check_suite_focus=true - https://github.com/apache/spark/runs/3338876122?check_suite_focus=true ### Why are the changes needed? To have more stable PySpark coverage report: https://app.codecov.io/gh/apache/spark ### Does this PR introduce _any_ user-facing change? Spark developers will be able to see more stable results in https://app.codecov.io/gh/apache/spark ### How was this patch tested? GitHub Actions' scheduled jobs will test them out. Closes #33808 from HyukjinKwon/SPARK-36560. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-24 11:08:43 +09:00
Xinrong Meng	0b6af464dc	[SPARK-36470][PYTHON] Implement `CategoricalIndex.map` and `DatetimeIndex.map` ### What changes were proposed in this pull request? Implement `CategoricalIndex.map` and `DatetimeIndex.map` `MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary. ### Why are the changes needed? Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well. ### Does this PR introduce _any_ user-facing change? Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now. - CategoricalIndex.map ```py >>> idx = ps.CategoricalIndex(['a', 'b', 'c']) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.map(lambda x: x.upper()) CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, dtype='category') >>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True)) >>> idx.map(pser) CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category') >>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category') ``` - DatetimeIndex.map ```py >>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10") >>> psidx = ps.from_pandas(pidx) >>> mapper_dict = { ... datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8), ... datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9), ... } >>> psidx.map(mapper_dict) DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None) >>> mapper_pser = pd.Series([1, 2, 3], index=pidx) >>> psidx.map(mapper_pser) Int64Index([1, 2, 3], dtype='int64') >>> psidx DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None) >>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r")) Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM', 'August 10, 2020, 12:00:00 AM'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33756 from xinrong-databricks/other_indexes_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-23 10:08:40 +09:00
itholic	f2e593bcf1	[SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3 ### What changes were proposed in this pull request? This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3. Before: ```python >>> pcat 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> pcat.astype(CategoricalDtype(["b", "c", "a"])) 0 a 1 b 2 c dtype: category Categories (3, object): ['b', 'c', 'a'] ``` After: ```python >>> pcat 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] >>> pcat.astype(CategoricalDtype(["b", "c", "a"])) 0 a 1 b 2 c dtype: category Categories (3, object): ['a', 'b', 'c'] # CategoricalDtype is not updated if dtype is the same ``` `CategoricalDtype` is treated as a same `dtype` if the unique values are the same. ```python >>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"])) >>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"])) >>> pcat1.dtype == pcat2.dtype True ``` ### Why are the changes needed? We should follow the latest pandas as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, the behavior is changed as example in the PR description. ### How was this patch tested? Unittest Closes #33757 from itholic/SPARK-36368. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-18 11:38:59 -07:00
itholic	c91ae544fd	[SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3 ### What changes were proposed in this pull request? This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests. ### Why are the changes needed? Some tests are missing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unittest Closes #33776 from itholic/SPARK-36388-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-18 11:17:01 -07:00
Takuya UESHIN	7fb8ea319e	[SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version ### What changes were proposed in this pull request? This is a follow-up of #33687. Use `LooseVersion` instead of `pkg_resources.parse_version`. ### Why are the changes needed? In the previous PR, `pkg_resources.parse_version` was used, but we should use `LooseVersion` instead to be consistent in the code base. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-18 10:36:09 +09:00
Cedric-Magnan	964dfe254f	[SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined ### What changes were proposed in this pull request? Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module. Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class. ### Why are the changes needed? This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the following command successfully : ```sh python/run-tests --testnames 'pyspark.pandas.tests.test_groupby' ``` Tests passed in 327 seconds Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas. Authored-by: Cedric-Magnan <cedric.magnan@artefact.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-17 10:46:49 -07:00
itholic	c0441bb7e8	[SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string ### What changes were proposed in this pull request? This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3. In pandas < 1.3, ```python >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string") 0 2020-10-27 00:00:01 1 NaT Name: datetime, dtype: string ``` This is changed to ```python >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string") 0 2020-10-27 00:00:01 1 <NA> Name: datetime, dtype: string ``` in pandas >= 1.3, so we follow the behavior of latest pandas. ### Why are the changes needed? Because pandas-on-Spark always follow the behavior of latest pandas. ### Does this PR introduce _any_ user-facing change? Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype) ### How was this patch tested? Unittest passed Closes #33735 from itholic/SPARK-36387. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-17 10:29:16 -07:00
Xinrong Meng	4dcd746025	[SPARK-36469][PYTHON] Implement Index.map ### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-16 11:06:10 -07:00
Liang-Chi Hsieh	8b8d91cf64	[SPARK-36465][SS] Dynamic gap duration in session window ### What changes were proposed in this pull request? This patch supports dynamic gap duration in session window. ### Why are the changes needed? The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows. ### Does this PR introduce _any_ user-facing change? Yes, users can specify dynamic gap duration. ### How was this patch tested? Modified existing tests and new test. Closes #33691 from viirya/dynamic-session-window-gap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-16 11:06:00 +09:00
Hyukjin Kwon	ccead315b3	[SPARK-36474][PYTHON][DOCS] Mention 'pandas API on Spark' in Spark overview pages ### What changes were proposed in this pull request? This PR proposes to mention pandas API on Spark at Spark overview pages. ### Why are the changes needed? To mention the new component. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documenation. ### How was this patch tested? Manually tested by MD editor. For `docs/index.md`, I manually checked by building the docs by `SKIP_API=1 bundle exec jekyll serve --watch`. Closes #33699 from HyukjinKwon/SPARK-36474. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-11 22:57:26 +09:00
itholic	b8508f4876	[SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3 ### What changes were proposed in this pull request? This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior. `RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3. Before: ```python >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) >>> df.groupby("A").rolling(2).sum() A B A 1 0 NaN NaN 1 2.0 1.0 2 2 NaN NaN 3 3 NaN NaN ``` After: ```python >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]}) >>> df.groupby("A").rolling(2).sum() B A 1 0 NaN 1 1.0 2 2 NaN 3 3 NaN ``` ### Why are the changes needed? We should follow the behavior of pandas as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above. ### How was this patch tested? Unit tests. Closes #33646 from itholic/SPARK-36388. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-10 10:12:52 +09:00
itholic	a9f371c247	[SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3 ### What changes were proposed in this pull request? This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3. Before: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64') ``` After: ```python >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2]) >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2]) >>> ps_idx1.union(ps_idx2) Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64') ``` This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289. ### Why are the changes needed? We should follow the behavior of pandas as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, the result for some cases have duplicates values will change. ### How was this patch tested? Unit test. Closes #33634 from itholic/SPARK-36369. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-09 11:10:01 +09:00
Weichen Xu	f9f6c0d350	[SPARK-36425][PYSPARK][ML] Support CrossValidatorModel get standard deviation of metrics for each paramMap Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Support CrossValidatorModel get standard deviation of metrics for each paramMap. ### Why are the changes needed? So that in mlflow autologging, we can log standard deviation of metrics which is useful. ### Does this PR introduce _any_ user-facing change? Yes. `CrossValidatorModel` add a public attribute `stdMetrics` which are the standard deviation of metrics for each paramMap ### How was this patch tested? Unit test. Closes #33652 from WeichenXu123/add_std_metric. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-09 10:08:52 +09:00
Kousuke Saruta	856c9a58f8	[SPARK-36173][CORE][PYTHON][FOLLOWUP] Add type hint for TaskContext.cpus ### What changes were proposed in this pull request? This PR adds type hint for `TaskContext.cpus` added in SPARK-36173 (#33385) ### Why are the changes needed? To comply with Project Zen. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed typehint works with IntelliJ IDEA. Closes #33645 from sarutak/taskcontext-pyi. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 10:56:10 +09:00
Wu, Xiaochang	f6e6d1157a	[SPARK-36173][CORE] Support getting CPU number in TaskContext In stage-level resource scheduling, the allocated 3rd party resources can be obtained in TaskContext using resources() interface, however there is no API to get how many cpus are allocated for the task. Will add a cpus() interface to TaskContext to complement resources(). Althrough the task cpu requests can be got from profile, it's more convenient to get it inside the task code without the need to pass profile from driver side to the executor side. ### What changes were proposed in this pull request? Add cpus() interface in TaskContext and modify relevant code. ### Why are the changes needed? TaskContext has resources() to get 3rd party resources allocated. the is no API to get CPU allocated for the task. ### Does this PR introduce _any_ user-facing change? Add cpus() interface for TaskContext ### How was this patch tested? Unit tests Closes #33385 from xwu99/taskcontext-cpus. Lead-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-08-04 21:14:01 -05:00
itholic	3d72c20e64	[SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io ### What changes were proposed in this pull request? This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message. ### Why are the changes needed? To improve the warning message. ### Does this PR introduce _any_ user-facing change? The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead." ### How was this patch tested? Manually run `dev/lint-python` Closes #33631 from itholic/SPARK-35811-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 16:20:29 +09:00
Xinrong Meng	8ca11fe39f	[SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists ### What changes were proposed in this pull request? Better error messages for DataTypeOps against lists. ### Why are the changes needed? Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead. ### Does this PR introduce _any_ user-facing change? Yes. A TypeError message will be showed rather than a Py4JJavaError. From: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt. : java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1] ... ``` To: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... TypeError: The operation can not be applied to list. ``` ### How was this patch tested? Unit tests. Closes #33581 from xinrong-databricks/data_type_ops_list. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 16:25:49 +09:00
Takuya UESHIN	8cb9cf39b6	[SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3 ### What changes were proposed in this pull request? Disable tests failed by the incompatible behavior of pandas 1.3. ### Why are the changes needed? Pandas 1.3 has been released. There are some behavior changes and we should follow it, but it's not ready yet. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Disabled some tests related to the behavior change. Closes #33598 from ueshin/issues/SPARK-36367/disable_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-03 14:02:18 +09:00
Linhong Liu	2f700773c2	[SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 23:19:54 +08:00
Hyukjin Kwon	c0d1860f25	[SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins ### What changes were proposed in this pull request? This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job. ### Why are the changes needed? For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members). Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used. ### Does this PR introduce _any_ user-facing change? Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same. ### How was this patch tested? I manually tested: - Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484 - Coverage report: `73f0291a7d/python/pyspark` - Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175 Closes #33591 from HyukjinKwon/SPARK-36092. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 21:37:19 +09:00
Enrico Minack	a65eb36bae	[SPARK-36319][SQL][PYTHON] Make Observation return Map instead of Row ### What changes were proposed in this pull request? The Observation API (Scala, Java, PySpark) now returns a `Map` / `Dict`. Before, it returned `Row` simply because the metrics are (internal to Observation) retrieved from the listener as rows. Since that is hidden from the user by the Observation API, there is no need to return `Row`. While touching this code, this moves the unit tests from `DataFrameSuite,scala` to `DatasetSuite.scala` and from `JavaDataFrameSuite.java` to `JavaDatasetSuite.java`, which is a better place. ### Why are the changes needed? This simplifies the API and accessing the metrics, especially in Java. There is no need for the concept `Row` when retrieving the observation result. ### Does this PR introduce _any_ user-facing change? Yes, it changes the return type of `get` from `Row` to `Map` (Scala) / `Dict` (Python) and introduces `getAsJavaMap` (Java). ### How was this patch tested? This is tested in `DatasetSuite.SPARK-34806: observation on datasets`, `JavaDatasetSuite.testObservation` and `test_dataframe.test_observe`. Closes #33545 from EnricoMi/branch-observation-returns-map. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 10:40:28 +09:00
Yikun Jiang	f04e991e6a	[SPARK-35976][PYTHON] Adjust astype method for ExtensionDtype in pandas API on Spark ### What changes were proposed in this pull request? This patch set value to `<NA>` (pd.NA) in BooleanExtensionOps and StringExtensionOps. ### Why are the changes needed? The pandas behavior: ```python >>> pd.Series([True, False, None], dtype="boolean").astype(str).tolist() ['True', 'False', '<NA>'] >>> pd.Series(['s1', 's2', None], dtype="string").astype(str).tolist() ['1', '2', '<NA>'] ``` pandas on spark ```python >>> import pandas as pd >>> from pyspark import pandas as ps # Before >>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist() ['True', 'False', 'None'] >>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist() ['True', 'False', 'None'] # After >>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist() ['True', 'False', '<NA>'] >>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist() ['s1', 's2', '<NA>'] ``` See more in [SPARK-35976](https://issues.apache.org/jira/browse/SPARK-35976) ### Does this PR introduce _any_ user-facing change? Yes, return `<NA>` when None to follow the pandas behavior ### How was this patch tested? Change the ut to cover this scenario. Closes #33585 from Yikun/SPARK-35976. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 10:37:25 +09:00
Takuya UESHIN	90d31dfcb7	[SPARK-36365][PYTHON] Remove old workarounds related to null ordering ### What changes were proposed in this pull request? Remove old workarounds related to null ordering. ### Why are the changes needed? In pandas-on-Spark, there are still some remaining places to call `Column._jc.(asc\|desc)_nulls_(first\|last)` as a workaround from Koalas to support Spark 2.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified a couple of tests and existing tests. Closes #33597 from ueshin/issues/SPARK-36365/nulls_first_last. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-01 10:33:25 +09:00
Hyukjin Kwon	74a6b9d23b	[SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33570, which mistakenly changed the default value of the default index ### Why are the changes needed? It was mistakenly changed. It was changed to check if the tests actually pass but I forgot to change it back. ### Does this PR introduce _any_ user-facing change? No, it's not related yet. It fixes up the mistake of the default value mistakenly changed. (Changed default value makes the test flaky because of the order affected by extra shuffle) ### How was this patch tested? Manually tested. Closes #33596 from HyukjinKwon/SPARK-36338-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-31 08:31:10 +09:00
Takuya UESHIN	895e3f5e2a	[SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps ### What changes were proposed in this pull request? Move some logic related to `F.nanvl` to `DataTypeOps`. ### Why are the changes needed? There are several places to branch by `FloatType` or `DoubleType` to use `F.nanvl` but `DataTypeOps` should handle it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33582 from ueshin/issues/SPARK-36350/nan_to_null. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-30 11:19:49 -07:00
Hyukjin Kwon	c6140d4d0a	[SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side ### What changes were proposed in this pull request? This PR proposes to implement `distributed-sequence` index in Scala side. ### Why are the changes needed? - Avoid unnecessary (de)serialization - Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104) ### Does this PR introduce _any_ user-facing change? No to end users since pandas API on Spark is not released yet. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(1).spark.print_schema() ``` Before: ``` root \|-- id: long (nullable = true) ``` After: ``` root \|-- id: long (nullable = false) ``` ### How was this patch tested? Manually tested, and existing tests should cover them. Closes #33570 from HyukjinKwon/SPARK-36338. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 22:29:23 +09:00
Hyukjin Kwon	dd2ca0aee2	[SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark ### What changes were proposed in this pull request? This PR is a partial revert of https://github.com/apache/spark/pull/33567 that keeps the logic to skip mlflow related tests if that's not installed. ### Why are the changes needed? It's consistent with other libraries, e.g) PyArrow. It also fixes up the potential dev breakage (see also https://github.com/apache/spark/pull/33567#issuecomment-889841829) ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? This is a partial revert. CI should test it out too. Closes #33589 from HyukjinKwon/SPARK-36254. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 22:28:19 +09:00
itholic	abce61f3fd	[SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI ### What changes were proposed in this pull request? This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark. ### Why are the changes needed? To enable the MLflow test in pandas API on Spark. ### Does this PR introduce _any_ user-facing change? No, it's test-only ### How was this patch tested? Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`. Closes #33567 from itholic/SPARK-36254. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-30 00:04:48 -07:00
itholic	94cb2bbbc2	[SPARK-35806][PYTHON][FOLLOW-UP] Mapping the mode argument to pandas in DataFrame.to_csv ### What changes were proposed in this pull request? This PR is follow-up for https://github.com/apache/spark/pull/33414 to support the more options for `mode` argument for all APIs that has `mode` argument, not only `DataFrame.to_csv`. ### Why are the changes needed? To keep the usage consistency for the arguments that have same name. ### Does this PR introduce _any_ user-facing change? More options is available for all APIs that has `mode` argument, same as `DataFrame.to_csv` ### How was this patch tested? Manually test on local Closes #33569 from itholic/SPARK-35085-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 12:48:24 +09:00
Takuya UESHIN	07ed82be0b	[SPARK-36333][PYTHON] Reuse isnull where the null check is needed ### What changes were proposed in this pull request? Reuse `IndexOpsMixin.isnull()` where the null check is needed. ### Why are the changes needed? There are some places where we can reuse `IndexOpsMixin.isnull()` instead of directly using Spark `Column`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33562 from ueshin/issues/SPARK-36333/reuse_isnull. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-29 15:33:11 -07:00
Samuel Moseley	a07df1acc6	[SPARK-36161][PYTHON] Add type check on dropDuplicates pyspark function ### What changes were proposed in this pull request? Improve the error message for wrong type when calling dropDuplicates in pyspark. ### Why are the changes needed? The current error message is cryptic and can be unclear to less experienced users. ### Does this PR introduce _any_ user-facing change? Yes, it adds a type error for when a user gives the wrong type to dropDuplicates ### How was this patch tested? There is currently no testing for error messages in pyspark dataframe functions Closes #33364 from sammyjmoseley/sm/add-type-checking-for-drop-duplicates. Lead-authored-by: Samuel Moseley <smoseley@palantir.com> Co-authored-by: Sammy Moseley <moseley.sammy@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-29 19:11:48 +09:00
Xinrong Meng	9c5cb99d6e	[SPARK-36190][PYTHON] Improve the rest of DataTypeOps tests by avoiding joins ### What changes were proposed in this pull request? Improve the rest of DataTypeOps tests by avoiding joins. ### Why are the changes needed? bool, string, numeric DataTypeOps tests have been improved by avoiding joins. We should improve the rest of the DataTypeOps tests in the same way. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33546 from xinrong-databricks/test_no_join. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-28 15:53:38 -07:00
Xinrong Meng	01213095e2	[SPARK-36143][PYTHON] Adjust `astype` of fractional Series with missing values to follow pandas ### What changes were proposed in this pull request? Adjust `astype` of fractional Series with missing values to follow pandas. Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas. ### Why are the changes needed? `astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. From: ```py >>> import numpy as np >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, np.nan]) >>> psser.astype(int) 0 1.0 1 2.0 2 NaN dtype: float64 ``` To: ```py >>> import numpy as np >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, np.nan]) >>> psser.astype(int) Traceback (most recent call last): ... ValueError: Cannot convert fractions with missing values to integer ``` ### How was this patch tested? Unit tests. Closes #33466 from xinrong-databricks/extension_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-28 11:26:48 -07:00
Takuya UESHIN	3c76a924ce	[SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns ### What changes were proposed in this pull request? Fix `Series`/`Index.copy()` to drop extra columns. ### Why are the changes needed? Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns. We can drop those when `Series`/`Index.copy()`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-28 18:39:53 +09:00
Takuya UESHIN	bcc595c112	[SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any() ### What changes were proposed in this pull request? Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`. ### Why are the changes needed? `IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error. Also it returns a wrong value when the `Series`/`Index` is empty. ```py >>> ps.Series([]).hasnans None ``` whereas: ```py >>> pd.Series([]).hasnans False ``` `IndexOpsMixin.any()` is safe for both cases. ### Does this PR introduce _any_ user-facing change? `IndexOpsMixin.hasnans` will return `False` when empty. ### How was this patch tested? Added some tests. Closes #33547 from ueshin/issues/SPARK-36310/hasnan. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-28 09:21:12 +09:00
Enrico Minack	f90eb6a5db	[SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark ### What changes were proposed in this pull request? With SPARK-34806 we can now easily add an equivalent for `Dataset.observe(Observation, Column, Column*)` to PySpark's `DataFrame` API. ### Why are the changes needed? This further aligns the Python DataFrame API with Scala Dataset API. ### Does this PR introduce _any_ user-facing change? Yes, it adds the `Observation` class and the `DataFrame.observe` method. ### How was this patch tested? Adds test `test_observe` to `pyspark.sql.test.test_dataframe`. Closes #33484 from EnricoMi/branch-observation-python. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 01:39:34 +08:00
Luran He	ede1bc6b51	[SPARK-36211][PYTHON] Correct typing of `udf` return value The following code should type-check: ```python3 import uuid import pyspark.sql.functions as F my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic() ``` ### What changes were proposed in this pull request? The `udf` function should return a more specific type. ### Why are the changes needed? Right now, `mypy` will throw spurious errors, such as for the code given above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests? Closes #33399 from luranhe/patch-1. Lead-authored-by: Luran He <luranjhe@gmail.com> Co-authored-by: Luran He <luran.he@compass.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-07-27 09:07:22 +02:00
Leona	9a47483f74	[SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents ### What changes were proposed in this pull request? Update api usage examples on PySpark pandas API documents. ### Why are the changes needed? If users try to use PySpark pandas API from the document, they will see some API deprication warnings. It is kind for users to update those documents to avoid confusion. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` make html ``` Closes #33519 from yoda-mon/update-pyspark-configurations. Authored-by: Leona <yodal@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:30:52 +09:00
Takuya UESHIN	c40d9d46f1	[SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Clean up `CategoricalAccessor` and `CategoricalIndex`. - Clean up the classes - Add deprecation warnings - Clean up the docs ### Why are the changes needed? To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33528 from ueshin/issues/SPARK-36267/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:17:18 +09:00
Yikun Jiang	d52c2de08b	[SPARK-36142][PYTHON] Follow Pandas when pow between fractional series with Na and bool literal ### What changes were proposed in this pull request? Set the result to 1 when the exp with 0(or False). ### Why are the changes needed? Currently, exponentiation between fractional series and bools is not consistent with pandas' behavior. ``` >>> pser = pd.Series([1, 2, np.nan], dtype=float) >>> psser = ps.from_pandas(pser) >>> pser False 0 1.0 1 1.0 2 1.0 dtype: float64 >>> psser False 0 1.0 1 1.0 2 NaN dtype: float64 ``` We ought to adjust that. See more in [SPARK-36142](https://issues.apache.org/jira/browse/SPARK-36142) ### Does this PR introduce _any_ user-facing change? Yes, it introduces a user-facing change, resulting in a different result for pow between fractional Series with missing values and bool literal, the results follow pandas behavior. ### How was this patch tested? - Add test_pow_with_float_nan ut - Exsiting test in test_pow Closes #33521 from Yikun/SPARK-36142. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:06:05 +09:00
Xinrong Meng	55971b70fe	[SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add set_categories to CategoricalAccessor and CategoricalIndex. ### Why are the changes needed? set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `set_categories`. ### How was this patch tested? Unit tests. Closes #33506 from xinrong-databricks/set_categories. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-26 17:12:33 -07:00
Dominik Gehl	ae1c20ee0d	[SPARK-36225][PYTHON][DOCS] Use DataFrame in python docstrings ### What changes were proposed in this pull request? Changing references to Dataset in python docstrings to DataFrame ### Why are the changes needed? no Dataset class in pyspark ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc change only Closes #33438 from dominikgehl/feature/SPARK-36225. Lead-authored-by: Dominik Gehl <dog@open.ch> Co-authored-by: Dominik Gehl <gehl@fastmail.fm> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-24 16:58:10 +09:00
Takuya UESHIN	663cbdfbe5	[SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9 ### What changes were proposed in this pull request? Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI. ### Why are the changes needed? Currently `lint-python` uses `python3`, but it's not the one we expect in CI. As a result, `black` check is not working. ``` The python3 -m black command was not found. Skipping black checks for now. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The `black` check in `lint-python` should work. Closes #33507 from ueshin/issues/SPARK-36279/lint-python. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-24 16:49:11 +09:00
Xinrong Meng	85adc2ff60	[SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals ### What changes were proposed in this pull request? Fix equality comparison of unordered Categoricals. ### Why are the changes needed? Codes of a Categorical Series are used for Series equality comparison. However, that doesn't apply to unordered Categoricals, where the same value can have different codes in two same categories in a different order. So we should map codes to value respectively and then compare the equality of value. ### Does this PR introduce _any_ user-facing change? Yes. From: ```py >>> psser1 = ps.Series(pd.Categorical(list("abca"))) >>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca"))) >>> with ps.option_context("compute.ops_on_diff_frames", True): ... (psser1 == psser2).sort_index() ... 0 True 1 True 2 True 3 False dtype: bool ``` To: ```py >>> psser1 = ps.Series(pd.Categorical(list("abca"))) >>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca"))) >>> with ps.option_context("compute.ops_on_diff_frames", True): ... (psser1 == psser2).sort_index() ... 0 False 1 False 2 False 3 True dtype: bool ``` ### How was this patch tested? Unit tests. Closes #33497 from xinrong-databricks/cat_bug. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-23 18:30:59 -07:00

1 2 3 4 5 ...

2913 commits