ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xinrong Meng	999cf81653	[SPARK-36190][PYTHON] Improve the rest of DataTypeOps tests by avoiding joins ### What changes were proposed in this pull request? Improve the rest of DataTypeOps tests by avoiding joins. ### Why are the changes needed? bool, string, numeric DataTypeOps tests have been improved by avoiding joins. We should improve the rest of the DataTypeOps tests in the same way. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33546 from xinrong-databricks/test_no_join. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `9c5cb99d6e`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-28 15:53:51 -07:00
Xinrong Meng	bdf1570911	[SPARK-36143][PYTHON] Adjust `astype` of fractional Series with missing values to follow pandas ### What changes were proposed in this pull request? Adjust `astype` of fractional Series with missing values to follow pandas. Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas. ### Why are the changes needed? `astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. From: ```py >>> import numpy as np >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, np.nan]) >>> psser.astype(int) 0 1.0 1 2.0 2 NaN dtype: float64 ``` To: ```py >>> import numpy as np >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, np.nan]) >>> psser.astype(int) Traceback (most recent call last): ... ValueError: Cannot convert fractions with missing values to integer ``` ### How was this patch tested? Unit tests. Closes #33466 from xinrong-databricks/extension_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `01213095e2`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-28 11:26:58 -07:00
Takuya UESHIN	c9af94ecb4	[SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns ### What changes were proposed in this pull request? Fix `Series`/`Index.copy()` to drop extra columns. ### Why are the changes needed? Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns. We can drop those when `Series`/`Index.copy()`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3c76a924ce`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-28 18:40:03 +09:00
Takuya UESHIN	0e9e737a84	[SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any() ### What changes were proposed in this pull request? Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`. ### Why are the changes needed? `IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error. Also it returns a wrong value when the `Series`/`Index` is empty. ```py >>> ps.Series([]).hasnans None ``` whereas: ```py >>> pd.Series([]).hasnans False ``` `IndexOpsMixin.any()` is safe for both cases. ### Does this PR introduce _any_ user-facing change? `IndexOpsMixin.hasnans` will return `False` when empty. ### How was this patch tested? Added some tests. Closes #33547 from ueshin/issues/SPARK-36310/hasnan. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `bcc595c112`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-28 09:21:22 +09:00
Takuya UESHIN	f278f771e6	[SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Clean up `CategoricalAccessor` and `CategoricalIndex`. - Clean up the classes - Add deprecation warnings - Clean up the docs ### Why are the changes needed? To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33528 from ueshin/issues/SPARK-36267/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c40d9d46f1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:17:26 +09:00
Yikun Jiang	139536c3ed	[SPARK-36142][PYTHON] Follow Pandas when pow between fractional series with Na and bool literal ### What changes were proposed in this pull request? Set the result to 1 when the exp with 0(or False). ### Why are the changes needed? Currently, exponentiation between fractional series and bools is not consistent with pandas' behavior. ``` >>> pser = pd.Series([1, 2, np.nan], dtype=float) >>> psser = ps.from_pandas(pser) >>> pser False 0 1.0 1 1.0 2 1.0 dtype: float64 >>> psser False 0 1.0 1 1.0 2 NaN dtype: float64 ``` We ought to adjust that. See more in [SPARK-36142](https://issues.apache.org/jira/browse/SPARK-36142) ### Does this PR introduce _any_ user-facing change? Yes, it introduces a user-facing change, resulting in a different result for pow between fractional Series with missing values and bool literal, the results follow pandas behavior. ### How was this patch tested? - Add test_pow_with_float_nan ut - Exsiting test in test_pow Closes #33521 from Yikun/SPARK-36142. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `d52c2de08b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:07:10 +09:00
Xinrong Meng	1641812e97	[SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add set_categories to CategoricalAccessor and CategoricalIndex. ### Why are the changes needed? set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `set_categories`. ### How was this patch tested? Unit tests. Closes #33506 from xinrong-databricks/set_categories. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `55971b70fe`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-26 17:12:45 -07:00
Takuya UESHIN	c1434b1928	[SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9 ### What changes were proposed in this pull request? Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI. ### Why are the changes needed? Currently `lint-python` uses `python3`, but it's not the one we expect in CI. As a result, `black` check is not working. ``` The python3 -m black command was not found. Skipping black checks for now. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The `black` check in `lint-python` should work. Closes #33507 from ueshin/issues/SPARK-36279/lint-python. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `663cbdfbe5`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-24 16:49:51 +09:00
Xinrong Meng	44cfce8548	[SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals ### What changes were proposed in this pull request? Fix equality comparison of unordered Categoricals. ### Why are the changes needed? Codes of a Categorical Series are used for Series equality comparison. However, that doesn't apply to unordered Categoricals, where the same value can have different codes in two same categories in a different order. So we should map codes to value respectively and then compare the equality of value. ### Does this PR introduce _any_ user-facing change? Yes. From: ```py >>> psser1 = ps.Series(pd.Categorical(list("abca"))) >>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca"))) >>> with ps.option_context("compute.ops_on_diff_frames", True): ... (psser1 == psser2).sort_index() ... 0 True 1 True 2 True 3 False dtype: bool ``` To: ```py >>> psser1 = ps.Series(pd.Categorical(list("abca"))) >>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca"))) >>> with ps.option_context("compute.ops_on_diff_frames", True): ... (psser1 == psser2).sort_index() ... 0 False 1 False 2 False 3 True dtype: bool ``` ### How was this patch tested? Unit tests. Closes #33497 from xinrong-databricks/cat_bug. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `85adc2ff60`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-23 18:31:18 -07:00
Takuya UESHIN	ab5224c45b	[SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `reorder_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `reorder_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `reorder_categories`. ### How was this patch tested? Added some tests. Closes #33499 from ueshin/issues/SPARK-36264/reorder_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `e12bc4d31d`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-23 17:19:32 -07:00
Takuya UESHIN	4abc1d389e	[SPARK-36261][PYTHON] Add remove_unused_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `remove_unused_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `remove_unused_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `remove_unused_categories`. ### How was this patch tested? Added some tests. Closes #33485 from ueshin/issues/SPARK-36261/remove_unused_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `2fe12a7520`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 14:05:09 +09:00
Xinrong Meng	37e5a10477	[SPARK-36248][PYTHON] Add rename_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add rename_categories to CategoricalAccessor and CategoricalIndex. ### Why are the changes needed? rename_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. `rename_categories` is supported in pandas API on Spark now. ```py # CategoricalIndex >>> psser = ps.CategoricalIndex(["a", "a", "b"]) >>> psser.rename_categories([0, 1]) CategoricalIndex([0, 0, 1], categories=[0, 1], ordered=False, dtype='category') >>> psser.rename_categories({'a': 'A', 'c': 'C'}) CategoricalIndex(['A', 'A', 'b'], categories=['A', 'b'], ordered=False, dtype='category') >>> psser.rename_categories(lambda x: x.upper()) CategoricalIndex(['A', 'A', 'B'], categories=['A', 'B'], ordered=False, dtype='category') # CategoricalAccessor >>> s = ps.Series(["a", "a", "b"], dtype="category") >>> s.cat.rename_categories([0, 1]) 0 0 1 0 2 1 dtype: category Categories (2, int64): [0, 1] >>> s.cat.rename_categories({'a': 'A', 'c': 'C'}) 0 A 1 A 2 b dtype: category Categories (2, object): ['A', 'b'] >>> s.cat.rename_categories(lambda x: x.upper()) 0 A 1 A 2 B dtype: category Categories (2, object): ['A', 'B'] ``` ### How was this patch tested? Unit tests. Closes #33471 from xinrong-databricks/category_rename_categories. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8b3d84bb7e`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:26:35 +09:00
Xinrong Meng	aeab18edd7	[SPARK-36189][PYTHON] Improve bool, string, numeric DataTypeOps tests by avoiding joins ### What changes were proposed in this pull request? Improve bool, string, numeric DataTypeOps tests by avoiding joins. Previously, bool, string, numeric DataTypeOps tests are conducted between two different Series. After the PR, bool, string, numeric DataTypeOps tests should perform on a single DataFrame. ### Why are the changes needed? A considerable number of DataTypeOps tests have operations on different Series, so joining is needed, which takes a long time. We shall avoid joins for a shorter test duration. The majority of joins happen in bool, string, numeric DataTypeOps tests, so we improve them first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33402 from xinrong-databricks/datatypeops_diffframe. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `75fd1f5b82`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:20:45 +09:00
Takuya UESHIN	9a004ae12d	[SPARK-36265][PYTHON] Use __getitem__ instead of getItem to suppress warnings ### What changes were proposed in this pull request? Use `Column.__getitem__` instead of `Column.getItem` to suppress warnings. ### Why are the changes needed? In pandas API on Spark code base, there are some places using `Column.getItem` with `Column` object, but it shows a deprecation warning. ### Does this PR introduce _any_ user-facing change? Yes, users won't see the warnings anymore. - before ```py >>> s = ps.Series(list("abbccc"), dtype="category") >>> s.astype(str) /path/to/spark/python/pyspark/sql/column.py:322: FutureWarning: A column as 'key' in getItem is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]` or `column.key` syntax instead. warnings.warn( 0 a 1 b 2 b 3 c 4 c 5 c dtype: object ``` - after ```py >>> s = ps.Series(list("abbccc"), dtype="category") >>> s.astype(str) 0 a 1 b 2 b 3 c 4 c 5 c dtype: object ``` ### How was this patch tested? Existing tests. Closes #33486 from ueshin/issues/SPARK-36265/getitem. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a76a087f7f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 11:27:42 +09:00
itholic	156ed24c52	[SPARK-35810][PYTHON][FOLLWUP] Deprecate ps.broadcast API ### What changes were proposed in this pull request? This PR follows up #33379 to fix build error in Sphinx ### Why are the changes needed? The Sphinx build is failed with missing newline in docstring ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test the Sphinx build Closes #33479 from itholic/SPARK-35810-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `d1a037a27c`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:10:14 +09:00
itholic	3a18864c5f	[SPARK-35809][PYTHON] Add `index_col` argument for ps.sql ### What changes were proposed in this pull request? This PR proposes adding an argument `index_col` for `ps.sql` function, to preserve the index when users want. NOTE that the `reset_index()` have to be performed before using `ps.sql` with `index_col`. ```python >>> psdf A B a 1 4 b 2 5 c 3 6 >>> psdf_reset_index = psdf.reset_index() >>> ps.sql("SELECT * from {psdf_reset_index} WHERE A > 1", index_col="index") A B index b 2 5 c 3 6 ``` Otherwise, the index is always lost. ```python >>> ps.sql("SELECT * from {psdf} WHERE A > 1") A B 0 2 5 1 3 6 ``` ### Why are the changes needed? Index is one of the key object for the existing pandas users, so we should provide the way to keep the index after computing the `ps.sql`. ### Does this PR introduce _any_ user-facing change? Yes, the new argument is added. ### How was this patch tested? Add a unit test and manually check the build pass. Closes #33450 from itholic/SPARK-35809. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `6578f0b135`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:08:42 +09:00
Takuya UESHIN	0e94e42cd3	[SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `remove_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `remove_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `remove_categories`. ### How was this patch tested? Added some tests. Closes #33474 from ueshin/issues/SPARK-36249/remove_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a3c7ae18e2`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:06:25 +09:00
Takuya UESHIN	f83a9ec2fd	[SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `add_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `add_categories`. ### How was this patch tested? Added some tests. Closes #33470 from ueshin/issues/SPARK-36214/add_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `dcc0aaa3ef`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-21 22:34:15 -07:00
Hyukjin Kwon	c42866e627	[SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package ### What changes were proposed in this pull request? This PR adds the version that added pandas API on Spark in PySpark documentation. ### Why are the changes needed? To document the version added. ### Does this PR introduce _any_ user-facing change? No to end user. Spark 3.2 is not released yet. ### How was this patch tested? Linter and documentation build. Closes #33473 from HyukjinKwon/SPARK-36253. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `f3e29574d9`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 14:21:53 +09:00
Takuya UESHIN	24095bfb07	[SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add categories setter to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement categories setter in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use categories setter. ### How was this patch tested? Added some tests. Closes #33448 from ueshin/issues/SPARK-36188/categories_setter. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `d506815a92`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-21 11:31:46 -07:00
Takuya UESHIN	a3a13da26c	[SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `as_ordered`/`as_unordered` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `as_ordered`/`as_unordered` in `CategoricalAccessor` and `CategoricalIndex` yet. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `as_ordered`/`as_unordered`. ### How was this patch tested? Added some tests. Closes #33400 from ueshin/issues/SPARK-36186/as_ordered_unordered. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `376fadc89c`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-20 18:24:09 -07:00
Takuya UESHIN	55c9dbd4d2	[SPARK-36167][PYTHON][3.2] Revisit more InternalField managements ### What changes were proposed in this pull request? This is a backport of #33377. Revisit and manage `InternalField` in more places. ### Why are the changes needed? There are other places we can manage `InternalField`, and we can keep extension dtypes or `CategoricalDtype`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added some tests. Closes #33384 from ueshin/issues/SPARK-36167/3.2/internal_field. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-20 09:30:35 +09:00
Xinrong Meng	48fadee158	[SPARK-36127][PYTHON] Support comparison between a Categorical and a scalar ### What changes were proposed in this pull request? Support comparison between a Categorical and a scalar. There are 3 main changes: - Modify `==` and `!=` from comparing codes of the Categorical to the scalar to comparing actual values of the Categorical to the scalar. - Support `<`, `<=`, `>`, `>=` between a Categorical and a scalar. - TypeError message fix. ### Why are the changes needed? pandas supports comparison between a Categorical and a scalar, we should follow pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```py >>> import pyspark.pandas as ps >>> import pandas as pd >>> from pandas.api.types import CategoricalDtype >>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True)) >>> psser = ps.from_pandas(pser) >>> psser == 2 0 True 1 False 2 False dtype: bool >>> psser <= 1 Traceback (most recent call last): ... NotImplementedError: <= can not be applied to categoricals. ``` After: ```py >>> import pyspark.pandas as ps >>> import pandas as pd >>> from pandas.api.types import CategoricalDtype >>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True)) >>> psser = ps.from_pandas(pser) >>> psser == 2 0 False 1 True 2 False dtype: bool >>> psser <= 1 0 True 1 True 2 True dtype: bool ``` ### How was this patch tested? Unit tests. Closes #33373 from xinrong-databricks/categorical_eq. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `8dd43351d5`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-19 15:06:56 -07:00
itholic	8d58211b9d	[SPARK-35806][PYTHON] Mapping the `mode` argument to pandas in DataFrame.to_csv ### What changes were proposed in this pull request? The `DataFrame.to_csv` has `mode` arguments both in pandas and pandas API on Spark. However, pandas allows the string "w", "w+", "a", "a+" where as pandas-on-Spark allows "append", "overwrite", "ignore", "error" or "errorifexists". We should map them while `mode` can still accept the existing parameters("append", "overwrite", "ignore", "error" or "errorifexists") as well. ### Why are the changes needed? APIs in pandas-on-Spark should follows the behavior of pandas for preventing the existing pandas code break. ### Does this PR introduce _any_ user-facing change? `DataFrame.to_csv` now can accept "w", "w+", "a", "a+" as well, same as pandas. ### How was this patch tested? Add the unit test and manually write the file with the new acceptable strings. Closes #33414 from itholic/SPARK-35806. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `2f42afc53a`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-19 19:58:19 +09:00
itholic	80a9644372	[SPARK-35810][PYTHON] Deprecate ps.broadcast API ### What changes were proposed in this pull request? The `broadcast` functions in `pyspark.pandas` is duplicated to `DataFrame.spark.hint` with `"broadcast"`. ```python # The below 2 lines are the same df.spark.hint("broadcast") ps.broadcast(df) ``` So, we should remove `broadcast` in the future, and show deprecation warning for now. ### Why are the changes needed? For deduplication of functions ### Does this PR introduce _any_ user-facing change? They see the deprecation warning when using `broadcast` in `pyspark.pandas`. ```python >>> ps.broadcast(df) FutureWarning: `broadcast` has been deprecated and will be removed in a future version. use `DataFrame.spark.hint` with 'broadcast' for `name` parameter instead. warnings.warn( ``` ### How was this patch tested? Manually check the warning message and see the build passed. Closes #33379 from itholic/SPARK-35810. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `67e6120a85`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-19 10:45:16 +09:00
Hyukjin Kwon	e9cc81151d	[SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade) ``` python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories" python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered" python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix" python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix" python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"? python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"? python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"? python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object" python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"? python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"? python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol? python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol? python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist" python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue" python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue" python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories" python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered" python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str" python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str") python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories" python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered" python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories" python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined python/pyspark/pandas/tests/test_typedef.py💯 error: Name "np.float" is not defined python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]") python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel" Found 32 errors in 15 files (checked 315 source files) 1 ``` Python 3.6 is deprecated at SPARK-35938 No. Maybe static analysis, etc. by some type hints but they are really non-breaking.. I manually checked by GitHub Actions build in forked repository. Closes #33356 from HyukjinKwon/SPARK-36146. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `a71dd6af2f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-16 11:41:53 +09:00
Xinrong Meng	ca8a3f2e23	[SPARK-36125][PYTHON] Implement non-equality comparison operators between two Categoricals ### What changes were proposed in this pull request? Implement non-equality comparison operators between two Categoricals. Non-goal: supporting Scalar input will be a follow-up task. ### Why are the changes needed? pandas supports non-equality comparisons between two Categoricals. We should follow that. ### Does this PR introduce _any_ user-facing change? Yes. No `NotImplementedError` for `<`, `<=`, `>`, `>=` operators between two Categoricals. An example is shown as below: From: ```py >>> import pyspark.pandas as ps >>> from pandas.api.types import CategoricalDtype >>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) >>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) >>> with ps.option_context("compute.ops_on_diff_frames", True): ... psser <= other_psser ... Traceback (most recent call last): ... NotImplementedError: <= can not be applied to categoricals. ``` To: ```py >>> import pyspark.pandas as ps >>> from pandas.api.types import CategoricalDtype >>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) >>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) >>> with ps.option_context("compute.ops_on_diff_frames", True): ... psser <= other_psser ... 0 False 1 True 2 True dtype: bool ``` ### How was this patch tested? Unit tests. Closes #33331 from xinrong-databricks/categorical_compare. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `0cb120f390`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-14 14:01:23 -07:00
Kousuke Saruta	9e802f25aa	[SPARK-36104][PYTHON][FOLLOWUP] Remove unused import "typing.cast" ### What changes were proposed in this pull request? This is a followup PR for SPARK-36104 (#33307) and removes unused import `typing.cast`. After that change, Python linter fails. ``` ./dev/lint-python shell: sh -e {0} env: LC_ALL: C.UTF-8 LANG: C.UTF-8 pythonLocation: /__t/Python/3.6.13/x64 LD_LIBRARY_PATH: /__t/Python/3.6.13/x64/lib starting python compilation test... python compilation succeeded. starting black test... black checks passed. starting pycodestyle test... pycodestyle checks passed. starting flake8 test... flake8 checks failed: ./python/pyspark/pandas/data_type_ops/num_ops.py:19:1: F401 'typing.cast' imported but unused from typing import cast, Any, Union ^ 1 F401 'typing.cast' imported but unused ``` ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #33315 from sarutak/followup-SPARK-36104. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `47fd3173a5`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-13 13:13:44 +09:00
Xinrong Meng	eae79dd31b	[SPARK-36104][PYTHON] Manage InternalField in DataTypeOps.neg/abs ### What changes were proposed in this pull request? Manage InternalField for DataTypeOps.neg/abs. ### Why are the changes needed? The spark data type and nullability must be the same as the original when DataTypeOps.neg/abs. We should manage InternalField for this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33307 from xinrong-databricks/internalField. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `5afc27f899`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-13 12:07:14 +09:00
Takuya UESHIN	8d9758ee46	[SPARK-36103][PYTHON] Manage InternalField in DataTypeOps.invert ### What changes were proposed in this pull request? Properly set `InternalField` for `DataTypeOps.invert`. ### Why are the changes needed? The spark data type and nullability must be the same as the original when `DataTypeOps.invert`. We should manage `InternalField` for this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33306 from ueshin/issues/SPARK-36103/invert. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `e2021daafb`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-13 09:22:37 +09:00
Xinrong Meng	606a99c01e	[SPARK-36003][PYTHON] Implement unary operator `invert` of integral ps.Series/Index ### What changes were proposed in this pull request? Implement unary operator `invert` of integral ps.Series/Index. ### Why are the changes needed? Currently, unary operator `invert` of integral ps.Series/Index is not supported. We ought to implement that following pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, 3]) >>> ~psser Traceback (most recent call last): ... NotImplementedError: Unary ~ can not be applied to integrals. ``` After: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([1, 2, 3]) >>> ~psser 0 -2 1 -3 2 -4 dtype: int64 ``` ### How was this patch tested? Unit tests. Closes #33285 from xinrong-databricks/numeric_invert. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `badb0393d4`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-12 15:10:37 +09:00
Takuya UESHIN	455c8922e2	[SPARK-36064][PYTHON] Manage InternalField more in DataTypeOps ### What changes were proposed in this pull request? Properly set `InternalField` more in `DataTypeOps`. ### Why are the changes needed? There are more places in `DataTypeOps` where we can manage `InternalField`. We should manage `InternalField` for these cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33275 from ueshin/issues/SPARK-36064/fields. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `95e6c6e3e9`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-12 11:55:20 +09:00
Xinrong Meng	862178b2a0	[SPARK-36035][PYTHON] Adjust `test_astype`, `test_neg` for old pandas versions ### What changes were proposed in this pull request? Adjust `test_astype`, `test_neg` for old pandas versions. ### Why are the changes needed? There are issues in old pandas versions that fail tests in pandas API on Spark. We ought to adjust `test_astype` and `test_neg` for old pandas versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Please refer to https://github.com/apache/spark/pull/33272 for test results with pandas 1.0.1. Closes #33250 from xinrong-databricks/SPARK-36035. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `698c4ec16b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-09 17:24:33 +09:00
Yikun Jiang	fd277dc036	[SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series ### What changes were proposed in this pull request? Merge test_decimal_ops into test_num_ops - merge test_isnull() into test_num_ops.test_isnull() - remove test_datatype_ops(), which already covered in `11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)` ### Why are the changes needed? Tests for data-type-based operations of decimal Series are in two places: - python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py - python/pyspark/pandas/tests/data_type_ops/test_num_ops.py We'd better merge test_decimal_ops into test_num_ops. See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unittests passed Closes #33206 from Yikun/SPARK-36002. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fdc50f4452`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-09 14:08:23 +09:00
Xinrong Meng	cb9bd5f455	[SPARK-36001][PYTHON] Assume result's index to be disordered in tests with operations on different Series ### What changes were proposed in this pull request? For tests with operations on different Series, sort index of results before comparing them with pandas. ### Why are the changes needed? We have many tests with operations on different Series in `spark/python/pyspark/pandas/tests/data_type_ops/` that assume the result's index to be sorted and then compare to the pandas' behavior. The assumption on the result's index ordering is wrong since Spark DataFrame join is used internally and the order is not preserved if the data being in different partitions. So we should assume the result to be disordered and sort the index of such results before comparing them with pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33274 from xinrong-databricks/datatypeops_testdiffframe. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `af81ad0d7e`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-09 12:42:58 +09:00
Xinrong Meng	b0cd00b062	[SPARK-35340][PYTHON] Standardize TypeError messages for unsupported basic operations ### What changes were proposed in this pull request? The PR is proposed to standardize TypeError messages for unsupported basic operations by: - Capitalize the first letter - Leverage TypeError messages defined in `pyspark/pandas/data_type_ops/base.py` - Take advantage of the utility `is_valid_operand_for_numeric_arithmetic` to save duplicated TypeError messages Related unit tests should be adjusted as well. ### Why are the changes needed? Inconsistent TypeError messages are shown for unsupported data-type-based basic operations. Take addition's TypeError messages for example: - addition can not be applied to given types. - string addition can only be applied to string series or literals. Standardizing TypeError messages would improve user experience and reduce maintenance costs. ### Does this PR introduce _any_ user-facing change? No user-facing behavior change. Only TypeError messages are modified. ### How was this patch tested? Unit tests. Closes #33237 from xinrong-databricks/datatypeops_err. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `819c482498`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-08 12:28:00 -07:00
Xinrong Meng	61bfdf0c03	[SPARK-35615][PYTHON] Make unary and comparison operators data-type-based ### What changes were proposed in this pull request? Make unary and comparison operators data-type-based. Refactored operators include: - Unary operators: `__neg__`, `__abs__`, `__invert__`, - Comparison operators: `>`, `>=`, `<`, `<=`, `==`, `!=` Non-goal: Tasks below are inspired during the development of this PR. [[SPARK-35997] Implement comparison operators for CategoricalDtype in pandas API on Spark](https://issues.apache.org/jira/browse/SPARK-35997) [[SPARK-36000] Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled](https://issues.apache.org/jira/browse/SPARK-36000) [[SPARK-36001] Assume result's index to be disordered in tests with operations on different Series](https://issues.apache.org/jira/browse/SPARK-36001) [[SPARK-36002] Consolidate tests for data-type-based operations of decimal Series](https://issues.apache.org/jira/browse/SPARK-36002) [[SPARK-36003] Implement unary operator `invert` of numeric ps.Series/Index](https://issues.apache.org/jira/browse/SPARK-36003) ### Why are the changes needed? We have been refactoring basic operators to be data-type-based for readability, flexibility, and extensibility. Unary and comparison operators are still not data-type-based yet. We should fill the gaps. ### Does this PR introduce _any_ user-facing change? Yes. - Better error messages. For example, Before: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b"2", b"3", b"4"]) >>> -psser Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ... ``` After: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b"2", b"3", b"4"]) >>> -psser Traceback (most recent call last): ... TypeError: Unary - can not be applied to binaries. >>> ``` - Support unary `-` of `bool` Series. For example, Before: ```py >>> psser = ps.Series([True, False, True]) >>> -psser Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ... ``` After: ```py >>> psser = ps.Series([True, False, True]) >>> -psser 0 False 1 True 2 False dtype: bool ``` ### How was this patch tested? Unit tests. Closes #33162 from xinrong-databricks/datatypeops_refactor. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `6e4e04f2a1`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-07 13:47:04 -07:00
Hyukjin Kwon	9cf1db33c7	[SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to bump up the mypy version to 0.910 which is the latest. ### Why are the changes needed? To catch the type hint mistakes better in PySpark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GitHub Actions should test it out. Closes #33223 from HyukjinKwon/SPARK-35684. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `16c195ccfb`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-07 13:26:41 +09:00
Takuya UESHIN	fcc9e66c9b	[SPARK-35981][PYTHON][TEST][3.2] Use check_exact=False to loosen the check precision ### What changes were proposed in this pull request? This is a cherry-pick of #33179. We should use `check_exact=False` because the value check in `StatsTest.test_cov_corr_meta` is too strict. ### Why are the changes needed? In some environment, the precision could be different in pandas' `DataFrame.corr` function and the test `StatsTest.test_cov_corr_meta` fails. ``` AssertionError: DataFrame.iloc[:, 0] (column name="a") are different DataFrame.iloc[:, 0] (column name="a") values are different (14.28571 %) [index]: [a, b, c, d, e, f, g] [left]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0] [right]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.807406715958909e-17] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified tests should still pass. Closes #33193 from ueshin/issuse/SPARK-35981/3.2/corr. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-02 14:08:50 -07:00
Xinrong Meng	95d94948c5	[SPARK-35339][PYTHON] Improve unit tests for data-type-based basic operations ### What changes were proposed in this pull request? Improve unit tests for data-type-based basic operations by: - removing redundant test cases - adding `astype` test for ExtensionDtypes ### Why are the changes needed? Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases. `astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33095 from xinrong-databricks/datatypeops_test. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-01 17:37:32 -07:00
Takuya UESHIN	a98c8ae57d	[SPARK-35944][PYTHON] Introduce Name and Label type aliases ### What changes were proposed in this pull request? Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`. - `Label`: `Tuple[Any, ...]` Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures. - `Name`: `Union[Any, Label]` External expression for user-facing names, which can be scalar values or tuples. ### Why are the changes needed? Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33159 from ueshin/issues/SPARK-35944/name_and_label. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-01 09:40:07 +09:00
Takuya UESHIN	0a838dcd71	[SPARK-35943][PYTHON] Introduce Axis type alias ### What changes were proposed in this pull request? Introduces `Axis` type alias for `axis` argument to be consistent. ### Why are the changes needed? There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33144 from ueshin/issues/SPARK-35943/axis. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 10:46:59 +09:00
itholic	28a201a442	[SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on Spark ### What changes were proposed in this pull request? This PR proposes removing the legacy Koalas version from pandas API on Spark package. And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version. ### Why are the changes needed? Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas. ### Does this PR introduce _any_ user-facing change? Now the legacy Koalas user should follow the version from PySpark. ### How was this patch tested? Manually built the package and see it's successfully done. Closes #33128 from itholic/SPARK-35873. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 10:01:51 +09:00
Takuya UESHIN	1f6e2f55d7	Revert "[SPARK-35721][PYTHON] Path level discover for python unittests" This reverts commit `5db51efa1a`.	2021-06-29 12:08:09 -07:00
Takuya UESHIN	2702fb9af0	[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark ### What changes were proposed in this pull request? Cleaning up the type hints in pandas-on-Spark. - Use a single file `_typing.py` for type variables or aliases - Rename `IndexOpsLike` to `SeriesOrIndex`. - Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively - Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]` ### Why are the changes needed? This is a cleanup for the mypy check stuff series. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33117 from ueshin/issues/SPARK-35859/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-29 10:52:24 -07:00
Yikun Jiang	5db51efa1a	[SPARK-35721][PYTHON] Path level discover for python unittests ### What changes were proposed in this pull request? Add path level discover for python unittests. ### Why are the changes needed? Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed. Such as: - pyspark-core pyspark.tests.test_pin_thread Thus we need some auto-discover way to find all testcase rather than specified every case by manually. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add below code in end of `dev/sparktestsupport/modules.py` ```python for m in sorted(all_modules): for g in sorted(m.python_test_goals): print(m.name, g) ``` Compare the result before and after: https://www.diffchecker.com/iO3FvhKL Closes #32867 from Yikun/SPARK_DISCOVER_TEST. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 17:56:13 +09:00
Xinrong Meng	5f0113e3a6	[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark ### What changes were proposed in this pull request? The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly: - Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input. ```py >>> from pyspark.pandas.spark import functions as SF >>> SF.lit(np.int64(1)) Column<'CAST(1 AS BIGINT)'> >>> SF.lit(np.int32(1)) Column<'CAST(1 AS INT)'> >>> SF.lit(np.int8(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.byte(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.float32(1)) Column<'CAST(1.0 AS FLOAT)'> ``` - Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals. - Enable numpy literals input in `isin` method Non-goal: - Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161). - To complete mappings between all kinds of numpy literals and Spark data types should be a followup task. ### Why are the changes needed? Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value. So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) Traceback (most recent call last): ... AttributeError: 'numpy.int64' object has no attribute '_get_object_id' ``` After: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) 0 True 1 True 2 False 3 False 4 False Name: source, dtype: bool ``` ### How was this patch tested? Unit tests. Closes #32955 from xinrong-databricks/datatypeops_literal. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-28 19:03:42 -07:00
Takuya UESHIN	8c401beb80	[SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window ### What changes were proposed in this pull request? Refines type hints in `pyspark.pandas.window`. Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`. ### Why are the changes needed? We can use more strict type hints for functions in pyspark.pandas.window using the generic way. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33097 from ueshin/issues/SPARK-35901/window. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 12:23:32 +09:00
itholic	03e6de2abe	[SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame ### What changes were proposed in this pull request? This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests. ### Why are the changes needed? Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore. And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense. ### Does this PR introduce _any_ user-facing change? No, it's kinda internal refactoring stuff. ### How was this patch tested? Added the related tests and manually check they're passed. Closes #33054 from itholic/SPARK-35605. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 11:47:09 +09:00
Takuya UESHIN	a9ebfc5374	[SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.data_type_ops.* ### What changes were proposed in this pull request? Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-25 18:16:25 -07:00

1 2 3

118 commits