ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kousuke Saruta	8b2b6bb0d3	[SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window ### What changes were proposed in this pull request? This PR adds PySpark API document of `session_window`. The docstring of the function doesn't comply with numpydoc format so this PR also fix it. Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR. ### Why are the changes needed? To provide PySpark users with the API document of the newly added function. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `make html` in `python/docs` and get the following docs. [window] ![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png) [session_window] ![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png) Closes #34118 from sarutak/python-session-window-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `5a32e41e9c`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-09-30 16:51:27 +09:00
Leona Yoda	b81e9741cd	[SPARK-36621][PYTHON][DOCS] Add Apache license headers to Pandas API on Spark documents ### What changes were proposed in this pull request? Apache license headers to Pandas API on Spark documents. ### Why are the changes needed? Pandas API on Spark document sources do not have license headers, while the other docs have. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `make html` Closes #33871 from yoda-mon/add-license-header. Authored-by: Leona Yoda <yodal@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `77fdf5f0e4`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-02 12:35:51 +09:00
Gengliang Wang	5463caac0d	Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization" ### What changes were proposed in this pull request? Revert `397b843890` and `5a48eb8d00` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci tests Closes #33819 from gengliangwang/revert-SPARK-34415. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `de932f51ce`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:39:29 -07:00
Xinrong Meng	56c211bd6a	[SPARK-36470][PYTHON] Implement `CategoricalIndex.map` and `DatetimeIndex.map` Implement `CategoricalIndex.map` and `DatetimeIndex.map` `MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary. Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well. Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now. - CategoricalIndex.map ```py >>> idx = ps.CategoricalIndex(['a', 'b', 'c']) >>> idx CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.map(lambda x: x.upper()) CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, dtype='category') >>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True)) >>> idx.map(pser) CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category') >>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category') ``` - DatetimeIndex.map ```py >>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10") >>> psidx = ps.from_pandas(pidx) >>> mapper_dict = { ... datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8), ... datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9), ... } >>> psidx.map(mapper_dict) DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None) >>> mapper_pser = pd.Series([1, 2, 3], index=pidx) >>> psidx.map(mapper_pser) Int64Index([1, 2, 3], dtype='int64') >>> psidx DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None) >>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r")) Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM', 'August 10, 2020, 12:00:00 AM'], dtype='object') ``` Unit tests. Closes #33756 from xinrong-databricks/other_indexes_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `0b6af464dc`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-23 10:11:21 +09:00
Xinrong Meng	cb14a32005	[SPARK-36469][PYTHON] Implement Index.map ### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `4dcd746025`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-08-16 11:06:23 -07:00
Takuya UESHIN	f278f771e6	[SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Clean up `CategoricalAccessor` and `CategoricalIndex`. - Clean up the classes - Add deprecation warnings - Clean up the docs ### Why are the changes needed? To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33528 from ueshin/issues/SPARK-36267/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c40d9d46f1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 12:17:26 +09:00
Xinrong Meng	1641812e97	[SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add set_categories to CategoricalAccessor and CategoricalIndex. ### Why are the changes needed? set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `set_categories`. ### How was this patch tested? Unit tests. Closes #33506 from xinrong-databricks/set_categories. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `55971b70fe`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-26 17:12:45 -07:00
Takuya UESHIN	ab5224c45b	[SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `reorder_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `reorder_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `reorder_categories`. ### How was this patch tested? Added some tests. Closes #33499 from ueshin/issues/SPARK-36264/reorder_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `e12bc4d31d`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-23 17:19:32 -07:00
Takuya UESHIN	4abc1d389e	[SPARK-36261][PYTHON] Add remove_unused_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `remove_unused_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `remove_unused_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `remove_unused_categories`. ### How was this patch tested? Added some tests. Closes #33485 from ueshin/issues/SPARK-36261/remove_unused_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `2fe12a7520`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 14:05:09 +09:00
Xinrong Meng	37e5a10477	[SPARK-36248][PYTHON] Add rename_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add rename_categories to CategoricalAccessor and CategoricalIndex. ### Why are the changes needed? rename_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas. ### Does this PR introduce _any_ user-facing change? Yes. `rename_categories` is supported in pandas API on Spark now. ```py # CategoricalIndex >>> psser = ps.CategoricalIndex(["a", "a", "b"]) >>> psser.rename_categories([0, 1]) CategoricalIndex([0, 0, 1], categories=[0, 1], ordered=False, dtype='category') >>> psser.rename_categories({'a': 'A', 'c': 'C'}) CategoricalIndex(['A', 'A', 'b'], categories=['A', 'b'], ordered=False, dtype='category') >>> psser.rename_categories(lambda x: x.upper()) CategoricalIndex(['A', 'A', 'B'], categories=['A', 'B'], ordered=False, dtype='category') # CategoricalAccessor >>> s = ps.Series(["a", "a", "b"], dtype="category") >>> s.cat.rename_categories([0, 1]) 0 0 1 0 2 1 dtype: category Categories (2, int64): [0, 1] >>> s.cat.rename_categories({'a': 'A', 'c': 'C'}) 0 A 1 A 2 b dtype: category Categories (2, object): ['A', 'b'] >>> s.cat.rename_categories(lambda x: x.upper()) 0 A 1 A 2 B dtype: category Categories (2, object): ['A', 'B'] ``` ### How was this patch tested? Unit tests. Closes #33471 from xinrong-databricks/category_rename_categories. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8b3d84bb7e`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 12:26:35 +09:00
itholic	94479a23c1	[SPARK-36239][PYTHON][DOCS] Remove some APIs from documentation ### What changes were proposed in this pull request? This PR proposes removing some APIs from pandas-on-Spark documentation. Because they can be easily workaround via Spark DataFrame or Column functions, so they might be removed In the future. ### Why are the changes needed? Because we don't want to expose some functions as a public API. ### Does this PR introduce _any_ user-facing change? The APIs such as `(Series\|Index).spark.data_type`, `(Series\|Index).spark.nullable`, `DataFrame.spark.schema`, `DataFrame.spark.print_schema`, `DataFrame.pandas_on_spark.attach_id_column`, `DataFrame.spark.checkpoint`, `DataFrame.spark.localcheckpoint` and `DataFrame.spark.explain` is removed in the documentation. ### How was this patch tested? Manually build the documents. Closes #33458 from itholic/SPARK-36239. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `86471ad668`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 19:46:49 +09:00
Takuya UESHIN	0e94e42cd3	[SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `remove_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `remove_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `remove_categories`. ### How was this patch tested? Added some tests. Closes #33474 from ueshin/issues/SPARK-36249/remove_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a3c7ae18e2`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-22 17:06:25 +09:00
Takuya UESHIN	f83a9ec2fd	[SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `add_categories` in `CategoricalAccessor` and `CategoricalIndex`. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `add_categories`. ### How was this patch tested? Added some tests. Closes #33470 from ueshin/issues/SPARK-36214/add_categories. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `dcc0aaa3ef`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-21 22:34:15 -07:00
Takuya UESHIN	a3a13da26c	[SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex ### What changes were proposed in this pull request? Add `as_ordered`/`as_unordered` to `CategoricalAccessor` and `CategoricalIndex`. ### Why are the changes needed? We should implement `as_ordered`/`as_unordered` in `CategoricalAccessor` and `CategoricalIndex` yet. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use `as_ordered`/`as_unordered`. ### How was this patch tested? Added some tests. Closes #33400 from ueshin/issues/SPARK-36186/as_ordered_unordered. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `376fadc89c`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-20 18:24:09 -07:00
Dominik Gehl	e7a210e5ed	[SPARK-36178][PYTHON] List pyspark.sql.catalog APIs in documentation ### What changes were proposed in this pull request? The pyspark.sql.catalog APIs were missing from the documentation. PR fixes this omission. ### Why are the changes needed? Documentation consistency ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation change only. Closes #33392 from dominikgehl/feature/SPARK-36178. Authored-by: Dominik Gehl <dog@open.ch> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fe4db74da4`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-19 19:49:22 +09:00
itholic	03e6de2abe	[SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame ### What changes were proposed in this pull request? This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests. ### Why are the changes needed? Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore. And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense. ### Does this PR introduce _any_ user-facing change? No, it's kinda internal refactoring stuff. ### How was this patch tested? Added the related tests and manually check they're passed. Closes #33054 from itholic/SPARK-35605. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 11:47:09 +09:00
itholic	712ed87faa	[SPARK-35696][PYTHON][DOCS] Refine the code examples in pandas-on-Spark documentation ### What changes were proposed in this pull request? This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas. For example, ```python kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) ``` should be refined to ```python psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) ``` Also fixed the several remaining Koalas stuffs in FAQ ### Why are the changes needed? Because we don't want to use the name "Koalas" in the Apache Spark anymore. ### Does this PR introduce _any_ user-facing change? Yes, the examples in the documentation will be changed with refined names. ### How was this patch tested? Manually built the docs and check one by one. Closes #33017 from itholic/SPARK-35696. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 14:48:13 +09:00
HyukjinKwon	41af409b7b	[SPARK-35303][PYTHON] Enable pinned thread mode by default ### What changes were proposed in this pull request? PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default. ### Why are the changes needed? To correctly support parallel job submission and management. ### Does this PR introduce _any_ user-facing change? Yes, now Python thread is mapped to JVM thread one to one. ### How was this patch tested? Existing tests should cover it. Closes #32429 from HyukjinKwon/SPARK-35303. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-18 12:02:29 +09:00
Hyukjin Kwon	95f36e76c6	[SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark" ### What changes were proposed in this pull request? This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface). ### Why are the changes needed? To make it sound more natural. ### Does this PR introduce _any_ user-facing change? It fixes a typo in the unreleased changes. ### How was this patch tested? N/A Closes #32903 from HyukjinKwon/SPARK-34885. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 10:01:04 +09:00
itholic	ebe529e8e1	[SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents ### What changes were proposed in this pull request? This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents. ### Why are the changes needed? Since we don't use the name "Koalas" anymore. We should use "Pandas APIs on Spark" instead. ### Does this PR introduce _any_ user-facing change? Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32835 from itholic/SPARK-35591. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-11 20:42:38 +09:00
Hyukjin Kwon	921abc51cf	[SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout ### What changes were proposed in this pull request? This PR proposes to restructure API files according to the layout, see https://github.com/apache/spark/pull/32799. Now the pandas APIs on Spark are under a separate directory which is same level as other modules such as Spark SQL. ```bash tree reference ``` Before: ``` reference ├── index.rst ├── ps_extensions.rst ├── ps_frame.rst ├── ps_general_functions.rst ├── ps_groupby.rst ├── ps_indexing.rst ├── ps_io.rst ├── ps_ml.rst ├── ps_series.rst ├── ps_window.rst ├── pyspark.ml.rst ├── pyspark.mllib.rst ├── pyspark.pandas.rst ├── pyspark.resource.rst ├── pyspark.rst ├── pyspark.sql.rst ├── pyspark.ss.rst └── pyspark.streaming.rst ``` After: ``` reference ├── index.rst ├── pyspark.ml.rst ├── pyspark.mllib.rst ├── pyspark.pandas │ ├── extensions.rst │ ├── frame.rst │ ├── general_functions.rst │ ├── groupby.rst │ ├── index.rst │ ├── indexing.rst │ ├── io.rst │ ├── ml.rst │ ├── series.rst │ └── window.rst ├── pyspark.resource.rst ├── pyspark.rst ├── pyspark.sql.rst ├── pyspark.ss.rst └── pyspark.streaming.rst ``` ### Why are the changes needed? To make the directory structure easier to follow. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually built and tested the docs. Closes #32812 from HyukjinKwon/SPARK-35646-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 19:01:56 +09:00
Hyukjin Kwon	7ce7aa4758	[SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in documentation ### What changes were proposed in this pull request? This PR proposes to change from: ![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png) to: ![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png) ### Why are the changes needed? pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL). ### Does this PR introduce _any_ user-facing change? Yes, it changes the structure of the docs. To end users, no as it's only in development branch. ### How was this patch tested? Manually tested as above. Closes #32799 from HyukjinKwon/SPARK-35646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 16:37:58 +09:00
Hyukjin Kwon	3d158f9c91	[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 11:11:09 +09:00
Kousuke Saruta	9283bebbbd	[SPARK-35418][SQL] Add sentences function to functions.{scala,py} ### What changes were proposed in this pull request? This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`. ### Why are the changes needed? This function can be only used from SQL for now. It's good if we can use this function from Scala/Python code as well as SQL. ### Does this PR introduce _any_ user-facing change? Yes. Users can use this function from Scala and Python. ### How was this patch tested? New test. Closes #32566 from sarutak/sentences-function. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-19 20:07:28 +09:00
Richard Penney	7d0743b493	[SPARK-33678][SQL] Product aggregation function ### Why is this change being proposed? This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group. This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark. This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly. ### Does this PR introduce _any_ user-facing change? No - only adds new function. ### How was this patch tested? Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself). An illustration of the new functionality, within PySpark is as follows: ``` import pyspark.sql.functions as pf, pyspark.sql.window as pw df = sqlContext.range(1, 17).toDF("x") win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x")) df.withColumn("factorial", pf.product("x").over(win)).show(20, False) +---+---------------+ \|x \|factorial \| +---+---------------+ \|1 \|1.0 \| \|2 \|2.0 \| \|3 \|6.0 \| \|4 \|24.0 \| \|5 \|120.0 \| \|6 \|720.0 \| \|7 \|5040.0 \| \|8 \|40320.0 \| \|9 \|362880.0 \| \|10 \|3628800.0 \| \|11 \|3.99168E7 \| \|12 \|4.790016E8 \| \|13 \|6.2270208E9 \| \|14 \|8.71782912E10 \| \|15 \|1.307674368E12 \| \|16 \|2.0922789888E13\| +---+---------------+ ``` Closes #30745 from rwpenney/feature/agg-product. Lead-authored-by: Richard Penney <rwp@rwpenney.uk> Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:51:07 +09:00
Phillip Henry	397b843890	[SPARK-34415][ML] Randomization in hyperparameter optimization ### What changes were proposed in this pull request? Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here: http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html All code is entirely my own work and I license the work to the project under the project’s open source license. ### Why are the changes needed? Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. ### Does this PR introduce _any_ user-facing change? A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined. ### How was this patch tested? Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added. `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface. `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed. Closes #31535 from PhillHenry/ParamRandomBuilder. Authored-by: Phillip Henry <PhillHenry@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-27 08:34:39 -06:00
Eric Lemmon	e3b6e4ad43	[SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs ### What changes were proposed in this pull request? Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well. ### Why are the changes needed? No docs were generated for `pyspark.sql.conf.RuntimeConfig`. ### Does this PR introduce _any_ user-facing change? Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch. ### How was this patch tested? First built the Python docs: ```bash cd $SPARK_HOME/docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve ``` Then verified all pages and links: 1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page: http://localhost:4000/api/python/reference/index.html ![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png) 2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page: http://localhost:4000/api/python/reference/pyspark.sql.html#configuration ![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)** 3. RuntimeConfig page displayed: http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html ![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png) 4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page: http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html ![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png) Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable. Authored-by: Eric Lemmon <eric@lemmon.cc> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-13 09:32:55 -06:00
HyukjinKwon	30468a9015	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs ### What changes were proposed in this pull request? This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621. In more details, this PR: - Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases. - (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases. - Deprecates and renames: - `sumDistinct` -> `sum_distinct` - `bitwiseNOT` -> `bitwise_not` - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`) - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`) - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`) - (Scala-specific) `callUDF` -> `call_udf` ### Why are the changes needed? To keep the consistent naming in APIs. ### Does this PR introduce _any_ user-facing change? Yes, it deprecates some APIs and add new renamed APIs as described above. ### How was this patch tested? Unittests were added. Closes #31408 from HyukjinKwon/SPARK-34306. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:29:40 +09:00
Huaxin Gao	f3548837c6	[SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector ### What changes were proposed in this pull request? Add UnivariateFeatureSelector ### Why are the changes needed? Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors. ### Does this PR introduce _any_ user-facing change? Yes ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100) ``` Or numTopFeatures ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100) ``` ### How was this patch tested? Add Unit test Closes #31160 from huaxingao/UnivariateSelector. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-01-16 11:09:23 +08:00
HyukjinKwon	aa388cf3d0	[SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs - `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page - Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html). - Mention other migration guides as well because PySpark can get affected by it. ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It fixes user-facing docs. However, it's not released out yet. ### How was this patch tested? Manually tested by running: ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` Closes #31082 from HyukjinKwon/SPARK-34041. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:28:31 +09:00
Weichen Xu	596fbc1d29	[SPARK-33556][ML] Add array_to_vector function for dataframe column ### What changes were proposed in this pull request? Add array_to_vector function for dataframe column ### Why are the changes needed? Utility function for array to vector conversion. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? scala unit test & doctest. Closes #30498 from WeichenXu123/array_to_vec. Lead-authored-by: Weichen Xu <weichen.xu@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 09:52:19 +09:00
zero323	d082ad0abf	[SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig functions in PySpark and SparkR ### What changes were proposed in this pull request? This PR adds the following functions (introduced in Scala API with SPARK-33061): - `acosh` - `asinh` - `atanh` to Python and R. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions. ### How was this patch tested? New unit tests. Closes #30501 from zero323/SPARK-33563. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 11:00:09 +09:00
zero323	01321bc0fe	[SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.mllib` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. Before: ![old](https://user-images.githubusercontent.com/1554276/100097941-90234980-2e5d-11eb-8b4d-c25d98d85191.png) After: ![new](https://user-images.githubusercontent.com/1554276/100097966-987b8480-2e5d-11eb-9e02-07b18c327624.png) ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30413 from zero323/SPARK-33252. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-25 10:24:41 +09:00
zero323	52073ef8ac	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.) ### What changes were proposed in this pull request? This PR proposes migration of Core to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? dev/lint-python and manual inspection. Closes #30320 from zero323/SPARK-33254. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 10:21:50 +09:00
HyukjinKwon	9818f079aa	[SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency ### What changes were proposed in this pull request? This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings. This PR also adds one migration example of `SparkContext`. - Before: ... ![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png) ... ![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png) ... - After: ... ![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png) ... See also https://numpydoc.readthedocs.io/en/latest/format.html ### Why are the changes needed? There are many reasons for switching to NumPy documentation style. 1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax. 2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`. 3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas, matplotlib, ... Using NumPy documentation style can give users a consistent documentation style. ### Does this PR introduce _any_ user-facing change? The dependency itself doesn't change anything user-facing. The documentation change in `SparkContext` does, as shown above. ### How was this patch tested? Manually tested via running `cd python` and `make clean html`. Closes #30149 from HyukjinKwon/SPARK-33243. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 14:03:57 +09:00
HyukjinKwon	7cdc921bc0	[SPARK-32188][PYTHON][DOCS][FOLLOW-UP] Document Column APIs in API reference ### What changes were proposed in this pull request? This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation. ### Why are the changes needed? To document common APIs in PySpark. ### Does this PR introduce _any_ user-facing change? Yes, `Column.*` will be shown in API reference page. ### How was this patch tested? Manually tested via `cd python` and `make clean html`. Closes #30150 from HyukjinKwon/SPARK-32188. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 09:52:09 +09:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
HyukjinKwon	5ce321dc80	[SPARK-33017][PYTHON][DOCS][FOLLOW-UP] Add getCheckpointDir into API documentation ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29918. We should add it into the documentation as well. ### Why are the changes needed? To show users new APIs. ### Does this PR introduce _any_ user-facing change? Yes, `SparkContext.getCheckpointDir` will be documented. ### How was this patch tested? Manually built the PySpark documentation: ```bash cd python/docs make clean html cd build/html open index.html ``` Closes #29960 from HyukjinKwon/SPARK-33017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-07 13:00:59 +09:00
HyukjinKwon	6868b40517	[SPARK-33020][PYTHON] Add nth_value as a PySpark function ### What changes were proposed in this pull request? `nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API. ### Why are the changes needed? To support the consistent APIs ### Does this PR introduce _any_ user-facing change? Yes, it introduces a new PySpark function API. ### How was this patch tested? Unittest was added. Closes #29899 from HyukjinKwon/SPARK-33020. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-09-28 22:14:28 -07:00
HyukjinKwon	c154629171	[SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow ### What changes were proposed in this pull request? This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide"). Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29548 from HyukjinKwon/SPARK-32183. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:09:06 +09:00
Sean Owen	891c5e661a	[MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation ### What changes were proposed in this pull request? The class `KMeansSummary` in pyspark is not included in `clustering.py`'s `__all__` declaration. It isn't included in the docs as a result. `InheritableThread` and `KMeansSummary` should be into corresponding RST files for documentation. ### Why are the changes needed? It seems like an oversight to not include this as all similar "summary" classes are. `InheritableThread` should also be documented. ### Does this PR introduce _any_ user-facing change? I don't believe there are functional changes. It should make this public class appear in docs. ### How was this patch tested? Existing tests / N/A. Closes #29470 from srowen/KMeansSummary. Lead-authored-by: Sean Owen <srowen@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-19 14:30:07 +09:00
Dongjoon Hyun	b421bf0196	[SPARK-32517][CORE] Add StorageLevel.DISK_ONLY_3 ### What changes were proposed in this pull request? This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`. ### Why are the changes needed? In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX. ### Does this PR introduce _any_ user-facing change? Yes. This provides a new built-in option. ### How was this patch tested? Pass the GitHub Action or Jenkins with the revised test cases. Closes #29331 from dongjoon-hyun/SPARK-32517. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-10 07:33:06 -07:00
Huaxin Gao	40e6a5bbb0	[SPARK-32449][ML][PYSPARK] Add summary to MultilayerPerceptronClassificationModel ### What changes were proposed in this pull request? Add training summary to MultilayerPerceptronClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate ### How was this patch tested? new tests Closes #29250 from huaxingao/mlp_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-29 09:58:25 -05:00
HyukjinKwon	6ab29b37cf	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 17:49:21 +09:00

44 commits