### What changes were proposed in this pull request?
This PR proposes to reorder the items in User Guide in PySpark documentation in order to place general guides first and advance ones later.
### Why are the changes needed?
For users to more easily follow.
### Does this PR introduce _any_ user-facing change?
Yes, it changes the order in the items in documentation .
### How was this patch tested?
Manually verified the documentation after building:
<img width="768" alt="Screen Shot 2021-03-22 at 2 38 41 PM" src="https://user-images.githubusercontent.com/6477701/111945072-5537d680-8b1c-11eb-9f43-02f3ad63a509.png">
FWIW, the current page: https://spark.apache.org/docs/latest/api/python/user_guide/index.htmlCloses#31922 from HyukjinKwon/SPARK-34818.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR upgrade Py4J from 0.10.9.1 to 0.10.9.2 that contains some bug fixes and improvements.
* expose shell parameter in Popen inside launch_gateway. ([bartdag/py4j220efc3](220efc3716))
* fixed Flake8 errors ([bartdag/py4j6c6ee9a](6c6ee9aedc))
### Why are the changes needed?
To leverage fixes from the upstream in Py4J.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Jenkins build and GitHub Actions will test it out.
Closes#31796 from wankunde/py4j.
Authored-by: wankunde <wankunde@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.
This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.
This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.
### Does this PR introduce _any_ user-facing change?
No - only adds new function.
### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).
An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw
df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))
df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x |factorial |
+---+---------------+
|1 |1.0 |
|2 |2.0 |
|3 |6.0 |
|4 |24.0 |
|5 |120.0 |
|6 |720.0 |
|7 |5040.0 |
|8 |40320.0 |
|9 |362880.0 |
|10 |3628800.0 |
|11 |3.99168E7 |
|12 |4.790016E8 |
|13 |6.2270208E9 |
|14 |8.71782912E10 |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```
Closes#30745 from rwpenney/feature/agg-product.
Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html
All code is entirely my own work and I license the work to the project under the project’s open source license.
### Why are the changes needed?
Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.
Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.
### Does this PR introduce _any_ user-facing change?
A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.
### How was this patch tested?
Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.
`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.
`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.
Closes#31535 from PhillHenry/ParamRandomBuilder.
Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR proposes:
- Change http to https for better security
- Change http://apache-spark-developers-list.1001551.n3.nabble.com/ to official mailing list link (https://mail-archives.apache.org/mod_mbox/spark-dev/)
### Why are the changes needed?
For better security, and to use official link.
### Does this PR introduce _any_ user-facing change?
Yes, It exposes more secure and correct links to the PySpark end users in PySpark documentation.
### How was this patch tested?
I manually checked if each link works
Closes#31616 from HyukjinKwon/minor-https.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler.
Some files and their responsibilities within this PR:
- `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access
- `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions
- `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19".
After this the correct process to update Jekyll version is the following:
1. update the version in `Gemfile`
2. run "bundle update" which updates the `Gemfile.lock`
3. commit both files
This process for version update is tested for details please check the testing section.
### Why are the changes needed?
Using different Jekyll versions can generate different output documents.
This PR standardize the process.
### Does this PR introduce _any_ user-facing change?
No, assuming the release was done via docker by using `do-release-docker.sh`.
In that case there should be no difference at all as the same Jekyll version is specified in the Gemfile.
### How was this patch tested?
#### Testing document generation
Doc generation step was triggered via the docker release:
```
$ ./do-release-docker.sh -d ~/working -n -s docs
...
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
Skipping publish step.
```
The docs.log contains the followings:
```
Building Spark docs
Fetching gem metadata from https://rubygems.org/.........
Using bundler 2.2.9
Fetching rb-fsevent 0.10.4
Fetching forwardable-extended 2.6.0
Fetching public_suffix 4.0.6
Fetching colorator 1.1.0
Fetching eventmachine 1.2.7
Fetching http_parser.rb 0.6.0
Fetching ffi 1.14.2
Fetching concurrent-ruby 1.1.8
Installing colorator 1.1.0
Installing forwardable-extended 2.6.0
Installing rb-fsevent 0.10.4
Installing public_suffix 4.0.6
Installing http_parser.rb 0.6.0 with native extensions
Installing eventmachine 1.2.7 with native extensions
Installing concurrent-ruby 1.1.8
Fetching rexml 3.2.4
Fetching liquid 4.0.3
Installing ffi 1.14.2 with native extensions
Installing rexml 3.2.4
Installing liquid 4.0.3
Fetching mercenary 0.4.0
Installing mercenary 0.4.0
Fetching rouge 3.26.0
Installing rouge 3.26.0
Fetching safe_yaml 1.0.5
Installing safe_yaml 1.0.5
Fetching unicode-display_width 1.7.0
Installing unicode-display_width 1.7.0
Fetching webrick 1.7.0
Installing webrick 1.7.0
Fetching pathutil 0.16.2
Fetching kramdown 2.3.0
Fetching terminal-table 2.0.0
Fetching addressable 2.7.0
Fetching i18n 1.8.9
Installing terminal-table 2.0.0
Installing pathutil 0.16.2
Installing i18n 1.8.9
Installing addressable 2.7.0
Installing kramdown 2.3.0
Fetching kramdown-parser-gfm 1.1.0
Installing kramdown-parser-gfm 1.1.0
Fetching rb-inotify 0.10.1
Fetching sassc 2.4.0
Fetching em-websocket 0.5.2
Installing rb-inotify 0.10.1
Installing em-websocket 0.5.2
Installing sassc 2.4.0 with native extensions
Fetching listen 3.4.1
Installing listen 3.4.1
Fetching jekyll-watch 2.2.1
Installing jekyll-watch 2.2.1
Fetching jekyll-sass-converter 2.1.0
Installing jekyll-sass-converter 2.1.0
Fetching jekyll 4.2.0
Installing jekyll 4.2.0
Fetching jekyll-redirect-from 0.16.0
Installing jekyll-redirect-from 0.16.0
Bundle complete! 4 Gemfile dependencies, 30 gems now installed.
Bundled gems are installed into `./.local_ruby_bundle`
```
#### Testing Jekyll (or other gem) update
First locally I reverted Jekyll to 4.1.0:
```
$ rm Gemfile.lock
$ rm -rf .local_ruby_bundle
# edited Gemfile to use version 4.1.0
$ cat Gemfile
source "https://rubygems.org"
gem "jekyll", "4.1.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ bundle install
...
```
Testing Jekyll version before the update:
```
$ bundle exec jekyll --version
jekyll 4.1.0
```
Imitating Jekyll update coming from git by reverting my local changes:
```
$ git checkout Gemfile
Updated 1 path from the index
$ cat Gemfile
source "https://rubygems.org"
gem "jekyll", "4.2.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ git checkout Gemfile.lock
Updated 1 path from the index
```
Run the install:
```
$ bundle install
...
```
Checking the updated Jekyll version:
```
$ bundle exec jekyll --version
jekyll 4.2.0
```
Closes#31559 from attilapiros/pin-jekyll-version.
Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.
In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
- `sumDistinct` -> `sum_distinct`
- `bitwiseNOT` -> `bitwise_not`
- `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
- `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
- `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
- (Scala-specific) `callUDF` -> `call_udf`
### Why are the changes needed?
To keep the consistent naming in APIs.
### Does this PR introduce _any_ user-facing change?
Yes, it deprecates some APIs and add new renamed APIs as described above.
### How was this patch tested?
Unittests were added.
Closes#31408 from HyukjinKwon/SPARK-34306.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR supplements the contents in the "Python Package Management".
If there is no Python installed in the local for all nodes when using `venv-pack`, job would fail as below.
```python
>>> from pyspark.sql.functions import pandas_udf
>>> pandas_udf('double')
... def pandas_plus_one(v: pd.Series) -> pd.Series:
... return v + 1
...
>>> spark.range(10).select(pandas_plus_one("id")).show()
...
Cannot run program "./environment/bin/python": error=2, No such file or directory
...
```
This is because the Python in the [packed environment via `venv-pack` has a symbolic link](https://github.com/jcrist/venv-pack/issues/5) that connects Python to the local one.
To avoid this confusion, it seems better to have an additional explanation for this.
### Why are the changes needed?
To provide more detailed information to users so that they don’t get confused
### Does this PR introduce _any_ user-facing change?
Yes, this PR fixes the part of "Python Package Management" in the "User Guide" documents.
### How was this patch tested?
Manually built the doc.
![Screen Shot 2021-01-21 at 7 10 38 PM](https://user-images.githubusercontent.com/44108233/105336258-5e8bec00-5c1c-11eb-870c-86acfc77c082.png)
Closes#31280 from itholic/SPARK-34190.
Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector
### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.
### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100)
```
Or
numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100)
```
### How was this patch tested?
Add Unit test
Closes#31160 from huaxingao/UnivariateSelector.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to:
- Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs
- `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page
- Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).
- Mention other migration guides as well because PySpark can get affected by it.
### Why are the changes needed?
For better documentation.
### Does this PR introduce _any_ user-facing change?
It fixes user-facing docs. However, it's not released out yet.
### How was this patch tested?
Manually tested by running:
```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```
Closes#31082 from HyukjinKwon/SPARK-34041.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).
### Why are the changes needed?
To leverage fixes from the upstream in Py4J.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Jenkins build and GitHub Actions will test it out.
Closes#31009 from HyukjinKwon/SPARK-33984.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to clarify:
- `PYSPARK_DRIVER_PYTHON` should not be set for cluster modes in YARN and Kubernates.
- `spark.yarn.appMasterEnv.PYSPARK_PYTHON` is not required in YARN. This is just another way to set `PYSPARK_PYTHON` that is specific for a Spark application.
### Why are the changes needed?
To clarify what's required and not.
### Does this PR introduce _any_ user-facing change?
Yes, this is a user-facing doc change.
### How was this patch tested?
Manually tested.
Note that this credits to gaborgsomogyi who actually tested and raised a doubt about this offline to me.
I also manually tested all again to double check.
Closes#30938 from HyukjinKwon/SPARK-33824-followup.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to restructure and refine the Python dependency management page.
I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation.
FWIW, it has been reviewed by some tech writers and engineers.
I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html
### Why are the changes needed?
For better documentation.
### Does this PR introduce _any_ user-facing change?
It's doc change but only in unreleased bracnhs for now.
### How was this patch tested?
I manually built the docs as:
```bash
cd python/docs
make clean html
open
```
Closes#30822 from HyukjinKwon/SPARK-33824.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.
Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.
### Why are the changes needed?
To tell developers that PySpark now follows NumPy documentation style.
### Does this PR introduce _any_ user-facing change?
No, it's a change in unreleased branches yet.
### How was this patch tested?
Manually tested via `make clean html` under `python/docs`:
![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)
Closes#30622 from HyukjinKwon/SPARK-33256.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
TL;DR:
- This PR completes the support of archives in Spark itself instead of Yarn-only
- It makes `--archives` option work in other cluster modes too and adds `spark.archives` configuration.
- After this PR, PySpark users can leverage Conda to ship Python packages together as below:
```python
conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
conda activate pyspark_env
conda pack -f -o pyspark_env.tar.gz
PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment
```
- Issue a warning that undocumented and hidden behavior of partial archive handling in `spark.files` / `SparkContext.addFile` will be deprecated, and users can use `spark.archives` and `SparkContext.addArchive`.
This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode:
```bash
./bin/spark-submit --help
```
```
Options:
...
Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
```
This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.
Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).
The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it.
Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well.
### Why are the changes needed?
To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.
### Does this PR introduce _any_ user-facing change?
Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`.
### How was this patch tested?
I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.
Closes#30486 from HyukjinKwon/native-archive.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add array_to_vector function for dataframe column
### Why are the changes needed?
Utility function for array to vector conversion.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
scala unit test & doctest.
Closes#30498 from WeichenXu123/array_to_vec.
Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adds the following functions (introduced in Scala API with SPARK-33061):
- `acosh`
- `asinh`
- `atanh`
to Python and R.
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
New functions.
### How was this patch tested?
New unit tests.
Closes#30501 from zero323/SPARK-33563.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
I believe it's self-descriptive.
### Why are the changes needed?
To document supported features.
### Does this PR introduce _any_ user-facing change?
Yes, the docs are updated. It's master only.
### How was this patch tested?
Manually built the docs via `cd python/docs` and `make clean html`:
![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png)
Closes#30436 from HyukjinKwon/minor-doc-fix.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.
### Why are the changes needed?
MapType was previous unsupported with Arrow.
### Does this PR introduce _any_ user-facing change?
User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs.
### How was this patch tested?
Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.
Closes#30393 from BryanCutler/arrow-add-MapType-SPARK-24554.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes migration of Core to NumPy documentation style.
### Why are the changes needed?
To improve documentation style.
### Does this PR introduce _any_ user-facing change?
Yes, this changes both rendered HTML docs and console representation (SPARK-33243).
### How was this patch tested?
dev/lint-python and manual inspection.
Closes#30320 from zero323/SPARK-33254.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings.
This PR also adds one migration example of `SparkContext`.
- **Before:**
...
![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png)
...
![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png)
...
- **After:**
...
![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png)
...
See also https://numpydoc.readthedocs.io/en/latest/format.html
### Why are the changes needed?
There are many reasons for switching to NumPy documentation style.
1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax.
2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`.
3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas,
matplotlib, ... Using NumPy documentation style can give users a consistent documentation style.
### Does this PR introduce _any_ user-facing change?
The dependency itself doesn't change anything user-facing.
The documentation change in `SparkContext` does, as shown above.
### How was this patch tested?
Manually tested via running `cd python` and `make clean html`.
Closes#30149 from HyukjinKwon/SPARK-33243.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation.
### Why are the changes needed?
To document common APIs in PySpark.
### Does this PR introduce _any_ user-facing change?
Yes, `Column.*` will be shown in API reference page.
### How was this patch tested?
Manually tested via `cd python` and `make clean html`.
Closes#30150 from HyukjinKwon/SPARK-32188.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add type hints guidelines to developer docs.
### Why are the changes needed?
Since it is a new and still somewhat evolving feature, we should provided clear guidelines for potential contributors.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#30094 from zero323/SPARK-33003.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.
### Why are the changes needed?
Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.
### Does this PR introduce _any_ user-facing change?
Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.
### How was this patch tested?
Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.
Closes#29947 from karenfeng/spark-32793.
Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/29918. We should add it into the documentation as well.
### Why are the changes needed?
To show users new APIs.
### Does this PR introduce _any_ user-facing change?
Yes, `SparkContext.getCheckpointDir` will be documented.
### How was this patch tested?
Manually built the PySpark documentation:
```bash
cd python/docs
make clean html
cd build/html
open index.html
```
Closes#29960 from HyukjinKwon/SPARK-33017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`nth_value` was added at SPARK-27951. This PR adds the corresponding PySpark API.
### Why are the changes needed?
To support the consistent APIs
### Does this PR introduce _any_ user-facing change?
Yes, it introduces a new PySpark function API.
### How was this patch tested?
Unittest was added.
Closes#29899 from HyukjinKwon/SPARK-33020.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to document PySpark specific packaging guidelines.
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
```
cd python/docs
make clean html
```
Closes#29806 from fhoering/add_doc_python_packaging.
Lead-authored-by: Fabian Höring <f.horing@criteo.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well).
### Why are the changes needed?
Hive 1.2 is a fork version. We shouldn't promote users to use.
### Does this PR introduce _any_ user-facing change?
Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only.
### How was this patch tested?
Manually tested:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v
```
Closes#29858 from HyukjinKwon/SPARK-32981.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes#29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide.
### Why are the changes needed?
The minimum version of PyArrow was upgraded to `1.0.0`.
### Does this PR introduce _any_ user-facing change?
Users see the correct minimum version in the installation guide.
### How was this patch tested?
N/A
Closes#29829 from ueshin/issues/SPARK-32312/doc.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR:
- rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors"
- documents extra dependency installation `pip install pyspark[sql]`
- uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html
- adds some more details
I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html
### Why are the changes needed?
To improve installation guide.
### Does this PR introduce _any_ user-facing change?
Yes, it updates the user-facing installation guide.
### How was this patch tested?
Manually built the doc and tested.
Closes#29779 from HyukjinKwon/SPARK-32180.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This simply fixes an .rst generation error in https://github.com/apache/spark/pull/29640Closes#29735 from srowen/SPARK-32180.2.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to add getting started- installation to new PySpark docs.
### Why are the changes needed?
Better documentation.
### Does this PR introduce _any_ user-facing change?
No. Documentation only.
### How was this patch tested?
Generating documents locally.
Closes#29640 from rohitmishr1484/SPARK-32180-Getting-Started-Installation.
Authored-by: Rohit.Mishra <rohit.mishra@utopusinsights.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Upgrade Apache Arrow to version 1.0.1 for the Java dependency and increase minimum version of PyArrow to 1.0.0.
This release marks a transition to binary stability of the columnar format (which was already informally backward-compatible going back to December 2017) and a transition to Semantic Versioning for the Arrow software libraries. Also note that the Java arrow-memory artifact has been split to separate dependence on netty-buffer and allow users to select an allocator. Spark will continue to use `arrow-memory-netty` to maintain performance benefits.
Version 1.0.0 - 1.0.0 include the following selected fixes/improvements relevant to Spark users:
ARROW-9300 - [Java] Separate Netty Memory to its own module
ARROW-9272 - [C++][Python] Reduce complexity in python to arrow conversion
ARROW-9016 - [Java] Remove direct references to Netty/Unsafe Allocators
ARROW-8664 - [Java] Add skip null check to all Vector types
ARROW-8485 - [Integration][Java] Implement extension types integration
ARROW-8434 - [C++] Ipc RecordBatchFileReader deserializes the Schema multiple times
ARROW-8314 - [Python] Provide a method to select a subset of columns of a Table
ARROW-8230 - [Java] Move Netty memory manager into a separate module
ARROW-8229 - [Java] Move ArrowBuf into the Arrow package
ARROW-7955 - [Java] Support large buffer for file/stream IPC
ARROW-7831 - [Java] unnecessary buffer allocation when calling splitAndTransferTo on variable width vectors
ARROW-6111 - [Java] Support LargeVarChar and LargeBinary types and add integration test with C++
ARROW-6110 - [Java] Support LargeList Type and add integration test with C++
ARROW-5760 - [C++] Optimize Take implementation
ARROW-300 - [Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD
ARROW-9098 - RecordBatch::ToStructArray cannot handle record batches with 0 column
ARROW-9066 - [Python] Raise correct error in isnull()
ARROW-9223 - [Python] Fix to_pandas() export for timestamps within structs
ARROW-9195 - [Java] Wrong usage of Unsafe.get from bytearray in ByteFunctionsHelper class
ARROW-7610 - [Java] Finish support for 64 bit int allocations
ARROW-8115 - [Python] Conversion when mixing NaT and datetime objects not working
ARROW-8392 - [Java] Fix overflow related corner cases for vector value comparison
ARROW-8537 - [C++] Performance regression from ARROW-8523
ARROW-8803 - [Java] Row count should be set before loading buffers in VectorLoader
ARROW-8911 - [C++] Slicing a ChunkedArray with zero chunks segfaults
View release notes here:
https://arrow.apache.org/release/1.0.1.htmlhttps://arrow.apache.org/release/1.0.0.html
### Why are the changes needed?
Upgrade brings fixes, improvements and stability guarantees.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests with pyarrow 1.0.0 and 1.0.1
Closes#29686 from BryanCutler/arrow-upgrade-100-SPARK-32312.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document the way of debugging PySpark. It's pretty much self-descriptive.
I made a demo site to review it more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/debugging.html
### Why are the changes needed?
To let users know how to debug PySpark applications.
### Does this PR introduce _any_ user-facing change?
Yes, it adds a new page in the documentation about debugging PySpark.
### How was this patch tested?
Manually built the doc.
Closes#29639 from HyukjinKwon/SPARK-32186.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add a page to describe how to test PySpark. Note that it avoids duplication of https://spark.apache.org/developer-tools.html and it more aims to add put the relevant links together.
I made a demo site to review more effectively: https://hyukjin-spark.readthedocs.io/en/stable/development/testing.html
### Why are the changes needed?
To guide PySpark developers easily test.
### Does this PR introduce _any_ user-facing change?
Yes, it will adds a new documentation page.
### How was this patch tested?
Manually tested.
Closes#29634 from HyukjinKwon/SPARK-32783.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a minor followup to fix:
1. Slightly reword the wording in the main page.
2. The indentation in the table at the migration guide;
from
![Screen Shot 2020-09-01 at 1 53 40 PM](https://user-images.githubusercontent.com/6477701/91796204-91781800-ec5a-11ea-9f57-d7a9f4207ba0.png)
to
![Screen Shot 2020-09-01 at 1 53 26 PM](https://user-images.githubusercontent.com/6477701/91796202-9046eb00-ec5a-11ea-9db2-815139ddfdb9.png)
### Why are the changes needed?
In order to show the migration guide pretty.
### Does this PR introduce _any_ user-facing change?
Yes, this is a change to user-facing documentation.
### How was this patch tested?
Manually built the documentation.
Closes#29606 from HyukjinKwon/SPARK-32191.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document PySpark specific contribution guides at "Development" section.
Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/development/contributing.html
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it is a new documentation. See the demo linked above.
### How was this patch tested?
```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```
and
```bash
cd python/docs
make clean html
```
Closes#29596 from HyukjinKwon/SPARK-32190.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").
Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.
### How was this patch tested?
```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```
and
```bash
cd python/docs
make clean html
```
Closes#29548 from HyukjinKwon/SPARK-32183.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The class `KMeansSummary` in pyspark is not included in `clustering.py`'s `__all__` declaration. It isn't included in the docs as a result.
`InheritableThread` and `KMeansSummary` should be into corresponding RST files for documentation.
### Why are the changes needed?
It seems like an oversight to not include this as all similar "summary" classes are.
`InheritableThread` should also be documented.
### Does this PR introduce _any_ user-facing change?
I don't believe there are functional changes. It should make this public class appear in docs.
### How was this patch tested?
Existing tests / N/A.
Closes#29470 from srowen/KMeansSummary.
Lead-authored-by: Sean Owen <srowen@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to add `StorageLevel.DISK_ONLY_3` as a built-in `StorageLevel`.
### Why are the changes needed?
In a YARN cluster, HDFS uaually provides storages with replication factor 3. So, we can save the result to HDFS to get `StorageLevel.DISK_ONLY_3` technically. However, disaggregate clusters or clusters without storage services are rising. Previously, in that situation, the users were able to use similar `MEMORY_AND_DISK_2` or a user-created `StorageLevel`. This PR aims to support those use cases officially for better UX.
### Does this PR introduce _any_ user-facing change?
Yes. This provides a new built-in option.
### How was this patch tested?
Pass the GitHub Action or Jenkins with the revised test cases.
Closes#29331 from dongjoon-hyun/SPARK-32517.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>