Commit graph

143 commits

Author SHA1 Message Date
Hyukjin Kwon 5a7686a393 [SPARK-35301][PYTHON][DOCS] Document migration guide from Koalas to pandas APIs on Spark
### What changes were proposed in this pull request?

This PR proposes to add a migration guide for legacy Koalas users in pandas API on Spark.

### Why are the changes needed?

For easier migration.

### Does this PR introduce _any_ user-facing change?

Yes, this adds a new page for migration from Koalas.

### How was this patch tested?

Manually built the docs and checked manually.

Closes #33050 from HyukjinKwon/SPARK-35301.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 17:58:09 +09:00
itholic 92ddef7cfb [SPARK-35696][PYTHON][DOCS][FOLLOW-UP] Fix underline for title in FAQ to remove warnings
### What changes were proposed in this pull request?

This PR follow-up for SPARK-35696 to fix incorrect underline in the documents to remove warnings.

### Why are the changes needed?

We should build the docs without any incorrect documentation style

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually build docs and see the warning is removed

Closes #33052 from itholic/SPARK-35696-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 15:20:13 +09:00
itholic 712ed87faa [SPARK-35696][PYTHON][DOCS] Refine the code examples in pandas-on-Spark documentation
### What changes were proposed in this pull request?

This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas.

For example,

```python
kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```

should be refined to

```python
psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```

Also fixed the several remaining Koalas stuffs in FAQ

### Why are the changes needed?

Because we don't want to use the name "Koalas" in the Apache Spark anymore.

### Does this PR introduce _any_ user-facing change?

Yes, the examples in the documentation will be changed with refined names.

### How was this patch tested?

Manually built the docs and check one by one.

Closes #33017 from itholic/SPARK-35696.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 14:48:13 +09:00
Hyukjin Kwon be9089731a [SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes to fix:
- the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one.
- update quickstart of pandas API on Spark, and make it working

The notebooks can be easily reviewed here:

https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb

Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

- To show the working examples of quickstart to end users.
- To allow users to try out the examples without installation easily.

### Does this PR introduce _any_ user-facing change?

No to end users because the existing quickstart of pandas API on Spark is not released yet.

### How was this patch tested?

I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0

Closes #33041 from HyukjinKwon/SPARK-35588-2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 10:17:22 +09:00
Hyukjin Kwon 27046582e4 [SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section
### What changes were proposed in this pull request?

This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts.

### Why are the changes needed?

pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete.

### Does this PR introduce _any_ user-facing change?

To end users, no because the docs are not released yet.

### How was this patch tested?

I manually built the docs and checked the output

Closes #33018 from HyukjinKwon/SPARK-35645.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-22 09:36:27 -07:00
HyukjinKwon 41af409b7b [SPARK-35303][PYTHON] Enable pinned thread mode by default
### What changes were proposed in this pull request?

PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default.

### Why are the changes needed?

To correctly support parallel job submission and management.

### Does this PR introduce _any_ user-facing change?

Yes, now Python thread is mapped to JVM thread one to one.

### How was this patch tested?

Existing tests should cover it.

Closes #32429 from HyukjinKwon/SPARK-35303.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-18 12:02:29 +09:00
Hyukjin Kwon 94bdbec380 [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section
### What changes were proposed in this pull request?

This PR proposes to merge contents and remove obsolete pages in Development section, especially about pandas API on Spark.

Some were removed, and some were merged to the existing PySpark guides. I will inline some comments in the PRs to make the review easier.

### Why are the changes needed?

To guide developers on the code base of pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it updates the user-facing documentation.

### How was this patch tested?

Manually built the docs and checked.

Closes #32926 from HyukjinKwon/SPARK-35644.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-17 13:35:20 +09:00
Hyukjin Kwon 95f36e76c6 [SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark"
### What changes were proposed in this pull request?

This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface).

### Why are the changes needed?

To make it sound more natural.

### Does this PR introduce _any_ user-facing change?

It fixes a typo in the unreleased changes.

### How was this patch tested?

N/A

Closes #32903 from HyukjinKwon/SPARK-34885.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 10:01:04 +09:00
Takuya UESHIN ef7545b788 [SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark
### What changes were proposed in this pull request?

Removes the upperbound for numpy for pandas-on-Spark.

### Why are the changes needed?

We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32908 from ueshin/issues/SPARK-35759/numpy.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 09:59:05 +09:00
itholic ebe529e8e1 [SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents
### What changes were proposed in this pull request?

This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents.

### Why are the changes needed?

Since we don't use the name "Koalas" anymore.

We should use "Pandas APIs on Spark" instead.

### Does this PR introduce _any_ user-facing change?

Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents.

### How was this patch tested?

Manually built the docs and checked one by one.

Closes #32835 from itholic/SPARK-35591.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-11 20:42:38 +09:00
Hyukjin Kwon afff42178c [SPARK-35647][PYTHON][DOCS] Restructure User Guide in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to restructure User Guide in PySpark documentation for pandas APIs on Spark.

**Before**

![Screen Shot 2021-06-08 at 8 47 41 PM](https://user-images.githubusercontent.com/6477701/121179493-cb85e280-c89a-11eb-8b93-552ebe7cd0a8.png)

**After**

![Screen Shot 2021-06-08 at 8 46 58 PM](https://user-images.githubusercontent.com/6477701/121179419-b3ae5e80-c89a-11eb-82a0-6dabbf1de12d.png)

Note that I mostly just moved the contents around except minor changes:
- Removing some questions in FAQ that don't make sense in Apache Spark
- Rename a subtitle "Working with pandas and PySpark" to "From/to pandas and PySpark DataFrames"

For renaming Koalas to either pandas-on-Spark or pandas APIs on Spark, it will be done at SPARK-35591

### Why are the changes needed?

For better readability.

### Does this PR introduce _any_ user-facing change?

Yes, it restructures the documentation as shown above.

### How was this patch tested?

I manually built the docs and tested.

Closes #32820 from HyukjinKwon/SPARK-35647.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-09 12:13:25 +09:00
Hyukjin Kwon 921abc51cf [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout
### What changes were proposed in this pull request?

This PR proposes to restructure API files according to the layout, see https://github.com/apache/spark/pull/32799. Now the pandas APIs on Spark are under a separate directory which is same level as other modules such as Spark SQL.

```bash
tree reference
```

**Before:**

```
reference
├── index.rst
├── ps_extensions.rst
├── ps_frame.rst
├── ps_general_functions.rst
├── ps_groupby.rst
├── ps_indexing.rst
├── ps_io.rst
├── ps_ml.rst
├── ps_series.rst
├── ps_window.rst
├── pyspark.ml.rst
├── pyspark.mllib.rst
├── pyspark.pandas.rst
├── pyspark.resource.rst
├── pyspark.rst
├── pyspark.sql.rst
├── pyspark.ss.rst
└── pyspark.streaming.rst
```

**After:**

```
reference
├── index.rst
├── pyspark.ml.rst
├── pyspark.mllib.rst
├── pyspark.pandas
│   ├── extensions.rst
│   ├── frame.rst
│   ├── general_functions.rst
│   ├── groupby.rst
│   ├── index.rst
│   ├── indexing.rst
│   ├── io.rst
│   ├── ml.rst
│   ├── series.rst
│   └── window.rst
├── pyspark.resource.rst
├── pyspark.rst
├── pyspark.sql.rst
├── pyspark.ss.rst
└── pyspark.streaming.rst
```

### Why are the changes needed?

To make the directory structure easier to follow.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually built and tested the docs.

Closes #32812 from HyukjinKwon/SPARK-35646-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 19:01:56 +09:00
Hyukjin Kwon 7ce7aa4758 [SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in documentation
### What changes were proposed in this pull request?

This PR proposes to change from:

![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png)

to:

![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png)

### Why are the changes needed?

pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL).

### Does this PR introduce _any_ user-facing change?

Yes, it changes the structure of the docs. To end users, no as it's only in development branch.

### How was this patch tested?

Manually tested as above.

Closes #32799 from HyukjinKwon/SPARK-35646.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-07 16:37:58 +09:00
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00
Kousuke Saruta 9283bebbbd [SPARK-35418][SQL] Add sentences function to functions.{scala,py}
### What changes were proposed in this pull request?

This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`.

### Why are the changes needed?

This function can be only used from SQL for now.
It's good if we can use this function from Scala/Python code as well as SQL.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use this function from Scala and Python.

### How was this patch tested?

New test.

Closes #32566 from sarutak/sentences-function.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-05-19 20:07:28 +09:00
Hyukjin Kwon 747fe7282c [SPARK-35419][PYTHON] Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers:

```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

**Before**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

**After**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

Note that the traceback (`return f(*args, **kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging.

In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it.

### Why are the changes needed?

To show simplified exception messages to end users.

### Does this PR introduce _any_ user-facing change?

Yes, it will hide the internal Python worker traceback.

### How was this patch tested?

Existing test cases should cover.

Closes #32569 from HyukjinKwon/SPARK-35419.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-18 12:27:09 +09:00
Xinrong Meng 5ecb112410 [SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst
### What changes were proposed in this pull request?

Use full names of modules in `install.rst` when specifying dependencies.

### Why are the changes needed?

Using full names makes it more clear.
In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual verification.

Closes #32427 from xinrong-databricks/nameDoc.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 11:02:57 +09:00
Xinrong Meng 120c389b00 [SPARK-34887][PYTHON] Port Koalas dependencies into PySpark
### What changes were proposed in this pull request?

Port Koalas dependencies appropriately to PySpark dependencies.

### Why are the changes needed?

pandas-on-Spark has its own required dependency and optional dependencies.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #32386 from xinrong-databricks/portDeps.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-04 09:04:23 +09:00
Yikun Jiang 44b7931936 [SPARK-35176][PYTHON] Standardize input validation error type
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas

### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.

[1] https://docs.python.org/3/library/exceptions.html#TypeError

Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.

### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.

### How was this patch tested?
Existing test case and UT

Closes #32368 from Yikun/SPARK-35176.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-05-03 15:34:24 +09:00
Yikun Jiang 0769049ee1 [SPARK-34979][PYTHON][DOC] Add PyArrow installation note for PySpark aarch64 user
### What changes were proposed in this pull request?

This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0.

### Why are the changes needed?

The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021.

See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979).

### Does this PR introduce _any_ user-facing change?
Yes, this doc can help user install arrow on aarch64.

### How was this patch tested?
doc test passed.

Closes #32363 from Yikun/SPARK-34979.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2021-04-28 09:56:17 +09:00
HyukjinKwon 2ca76a57be [MINOR][DOCS] Use ASCII characters when possible in PySpark documentation
### What changes were proposed in this pull request?

This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation

### Why are the changes needed?

To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782

### Does this PR introduce _any_ user-facing change?

Virtually no.

### How was this patch tested?

Found via (Mac OS):

```bash
# In Spark root directory
cd python
pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .`
```

Closes #32048 from HyukjinKwon/minor-fix.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-04-04 09:49:36 +03:00
David Li 1237124062 [SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct
### What changes were proposed in this pull request?

As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include:
- toPandas() may be slower;
- the resulting dataframe may not support some Pandas operations due to immutable backing arrays.

### Why are the changes needed?

This will hopefully reduce user confusion as with SPARK-34463.

### Does this PR introduce _any_ user-facing change?

Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental.

### How was this patch tested?
This is a documentation-only change.

Closes #31738 from lidavidm/spark-34463.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-30 13:30:27 +09:00
HyukjinKwon c7bf8adc38 [SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide at PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to reorder the items in User Guide in PySpark documentation in order to place general guides first and advance ones later.

### Why are the changes needed?

For users to more easily follow.

### Does this PR introduce _any_ user-facing change?

Yes, it changes the order in the items in documentation .

### How was this patch tested?

Manually verified the documentation after building:

<img width="768" alt="Screen Shot 2021-03-22 at 2 38 41 PM" src="https://user-images.githubusercontent.com/6477701/111945072-5537d680-8b1c-11eb-9f43-02f3ad63a509.png">

FWIW, the current page: https://spark.apache.org/docs/latest/api/python/user_guide/index.html

Closes #31922 from HyukjinKwon/SPARK-34818.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-22 15:53:39 +09:00
wankunde 60e324aa9f [SPARK-34688][PYTHON] Upgrade to Py4J 0.10.9.2
### What changes were proposed in this pull request?
This PR upgrade Py4J from 0.10.9.1 to 0.10.9.2 that contains some bug fixes and improvements.

* expose shell parameter in Popen inside launch_gateway. ([bartdag/py4j220efc3](220efc3716))
* fixed Flake8 errors ([bartdag/py4j6c6ee9a](6c6ee9aedc))

### Why are the changes needed?
To leverage fixes from the upstream in Py4J.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Jenkins build and GitHub Actions will test it out.

Closes #31796 from wankunde/py4j.

Authored-by: wankunde <wankunde@163.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-03-11 09:51:41 -06:00
HyukjinKwon 2526fdea48 [SPARK-34657][PYTHON][DOCS] Replace the tag of release to the hash to hide RC tags in Binder
### What changes were proposed in this pull request?

Currently Binder link at Spark 3.1.1 (https://mybinder.org/v2/gh/apache/spark/v3.1.1-rc3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb) shows  `v3.1.1-rc3` like:
![Screen Shot 2021-03-08 at 10 10 55 AM](https://user-images.githubusercontent.com/6477701/110262729-ecb70880-7ff7-11eb-92ba-f151d74985a6.png)

After the fix, it will shows the explicit hash:

![Screen Shot 2021-03-08 at 10 17 25 AM](https://user-images.githubusercontent.com/6477701/110262740-f476ad00-7ff7-11eb-8632-5b418ff87024.png)

In addition, this also fixes the examples URL while I am fixing it. For example: https://github.com/apache/spark/tree/v3.1.1-rc3/examples/src/main/python -> https://github.com/apache/spark/tree/1d550c4e902/examples/src/main/python

Note that it is hash in order to make both dev and release easier.

### Why are the changes needed?

To hide RC tags.

### Does this PR introduce _any_ user-facing change?

It will just change the URL shown when Binder is being loaded.

### How was this patch tested?

Manually tested:

```bash
make clean html
```

![Screen Shot 2021-03-08 at 10 17 06 AM](https://user-images.githubusercontent.com/6477701/110262813-2ee04a00-7ff8-11eb-9983-c4484f7832c4.png)

```bash
git_hash=`git rev-parse --short HEAD`
export GIT_HASH=$git_hash
make clean html
```

![Screen Shot 2021-03-08 at 10 17 25 AM](https://user-images.githubusercontent.com/6477701/110262805-2982ff80-7ff8-11eb-8560-e1e2aa7b263a.png)

Closes #31773 from HyukjinKwon/SPARK-34657.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-08 10:48:17 +09:00
Richard Penney 7d0743b493 [SPARK-33678][SQL] Product aggregation function
### Why is this change being proposed?
This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group.

This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark.

This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly.

### Does this PR introduce _any_ user-facing change?
No - only adds new function.

### How was this patch tested?
Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself).

An illustration of the new functionality, within PySpark is as follows:
```
import pyspark.sql.functions as pf, pyspark.sql.window as pw

df = sqlContext.range(1, 17).toDF("x")
win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x"))

df.withColumn("factorial", pf.product("x").over(win)).show(20, False)
+---+---------------+
|x  |factorial      |
+---+---------------+
|1  |1.0            |
|2  |2.0            |
|3  |6.0            |
|4  |24.0           |
|5  |120.0          |
|6  |720.0          |
|7  |5040.0         |
|8  |40320.0        |
|9  |362880.0       |
|10 |3628800.0      |
|11 |3.99168E7      |
|12 |4.790016E8     |
|13 |6.2270208E9    |
|14 |8.71782912E10  |
|15 |1.307674368E12 |
|16 |2.0922789888E13|
+---+---------------+
```

Closes #30745 from rwpenney/feature/agg-product.

Lead-authored-by: Richard Penney <rwp@rwpenney.uk>
Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-03-02 16:51:07 +09:00
Phillip Henry 397b843890 [SPARK-34415][ML] Randomization in hyperparameter optimization
### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-27 08:34:39 -06:00
HyukjinKwon b5470ae294 [MINOR][DOCS] Replace http to https when possible in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes:
- Change http to https for better security
- Change http://apache-spark-developers-list.1001551.n3.nabble.com/ to official mailing list link (https://mail-archives.apache.org/mod_mbox/spark-dev/)

### Why are the changes needed?

For better security, and to use official link.

### Does this PR introduce _any_ user-facing change?

Yes, It exposes more secure and correct links to the PySpark end users in PySpark documentation.

### How was this patch tested?

I manually checked if each link works

Closes #31616 from HyukjinKwon/minor-https.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-23 11:18:47 +09:00
“attilapiros” bdcad33d8b [SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler
### What changes were proposed in this pull request?

Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler.

Some files and their responsibilities within this PR:
- `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access
- `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions
- `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19".

After this the correct process to update Jekyll version is the following:
1. update the version in `Gemfile`
2. run "bundle update" which updates the `Gemfile.lock`
3. commit both files

This process for version update is tested for details please check the testing section.

### Why are the changes needed?

Using different Jekyll versions can generate different output documents.
This PR standardize the process.

### Does this PR introduce _any_ user-facing change?

No, assuming the release was done via docker by using `do-release-docker.sh`.
In that case  there should be no difference at all as the same Jekyll version is specified in the Gemfile.

### How was this patch tested?

#### Testing document generation

Doc generation step was triggered via  the docker release:

```
$ ./do-release-docker.sh -d ~/working -n -s docs
...
========================
= Building documentation...
Command: /opt/spark-rm/release-build.sh docs
Log file: docs.log
Skipping publish step.
```

The docs.log contains the followings:
```
Building Spark docs
Fetching gem metadata from https://rubygems.org/.........
Using bundler 2.2.9
Fetching rb-fsevent 0.10.4
Fetching forwardable-extended 2.6.0
Fetching public_suffix 4.0.6
Fetching colorator 1.1.0
Fetching eventmachine 1.2.7
Fetching http_parser.rb 0.6.0
Fetching ffi 1.14.2
Fetching concurrent-ruby 1.1.8
Installing colorator 1.1.0
Installing forwardable-extended 2.6.0
Installing rb-fsevent 0.10.4
Installing public_suffix 4.0.6
Installing http_parser.rb 0.6.0 with native extensions
Installing eventmachine 1.2.7 with native extensions
Installing concurrent-ruby 1.1.8
Fetching rexml 3.2.4
Fetching liquid 4.0.3
Installing ffi 1.14.2 with native extensions
Installing rexml 3.2.4
Installing liquid 4.0.3
Fetching mercenary 0.4.0
Installing mercenary 0.4.0
Fetching rouge 3.26.0
Installing rouge 3.26.0
Fetching safe_yaml 1.0.5
Installing safe_yaml 1.0.5
Fetching unicode-display_width 1.7.0
Installing unicode-display_width 1.7.0
Fetching webrick 1.7.0
Installing webrick 1.7.0
Fetching pathutil 0.16.2
Fetching kramdown 2.3.0
Fetching terminal-table 2.0.0
Fetching addressable 2.7.0
Fetching i18n 1.8.9
Installing terminal-table 2.0.0
Installing pathutil 0.16.2
Installing i18n 1.8.9
Installing addressable 2.7.0
Installing kramdown 2.3.0
Fetching kramdown-parser-gfm 1.1.0
Installing kramdown-parser-gfm 1.1.0
Fetching rb-inotify 0.10.1
Fetching sassc 2.4.0
Fetching em-websocket 0.5.2
Installing rb-inotify 0.10.1
Installing em-websocket 0.5.2
Installing sassc 2.4.0 with native extensions
Fetching listen 3.4.1
Installing listen 3.4.1
Fetching jekyll-watch 2.2.1
Installing jekyll-watch 2.2.1
Fetching jekyll-sass-converter 2.1.0
Installing jekyll-sass-converter 2.1.0
Fetching jekyll 4.2.0
Installing jekyll 4.2.0
Fetching jekyll-redirect-from 0.16.0
Installing jekyll-redirect-from 0.16.0
Bundle complete! 4 Gemfile dependencies, 30 gems now installed.
Bundled gems are installed into `./.local_ruby_bundle`
```

#### Testing Jekyll (or other gem) update

First locally I reverted Jekyll to 4.1.0:
```
$ rm Gemfile.lock
$ rm -rf .local_ruby_bundle

# edited Gemfile to use version 4.1.0
$ cat Gemfile
source "https://rubygems.org"

gem "jekyll", "4.1.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"
$ bundle install
...
```

Testing Jekyll version before the update:

```
$ bundle exec jekyll --version
jekyll 4.1.0
```

Imitating Jekyll update coming from git by reverting my local changes:

```
$ git checkout Gemfile
Updated 1 path from the index
$ cat Gemfile
source "https://rubygems.org"

gem "jekyll", "4.2.0"
gem "rouge", "3.26.0"
gem "jekyll-redirect-from", "0.16.0"
gem "webrick", "1.7"

$ git checkout Gemfile.lock
Updated 1 path from the index
```

Run the install:

```
$ bundle install
...
```

Checking the updated Jekyll version:
```
$ bundle exec jekyll --version
jekyll 4.2.0
```

Closes #31559 from attilapiros/pin-jekyll-version.

Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-18 12:17:57 +09:00
Eric Lemmon e3b6e4ad43 [SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs
### What changes were proposed in this pull request?
Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well.

### Why are the changes needed?
No docs were generated for `pyspark.sql.conf.RuntimeConfig`.

### Does this PR introduce _any_ user-facing change?
Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch.

### How was this patch tested?
First built the Python docs:
```bash
cd $SPARK_HOME/docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve
```
Then verified all pages and links:
1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page:
http://localhost:4000/api/python/reference/index.html
![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png)

2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page:
http://localhost:4000/api/python/reference/pyspark.sql.html#configuration
![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)**

3. RuntimeConfig page displayed:
http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html
![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png)

4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page:
http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html
![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png)

Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable.

Authored-by: Eric Lemmon <eric@lemmon.cc>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-02-13 09:32:55 -06:00
HyukjinKwon 30468a9015 [SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs
### What changes were proposed in this pull request?

This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.

In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
  - `sumDistinct` -> `sum_distinct`
  - `bitwiseNOT` -> `bitwise_not`
  - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
  - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
  - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
  - (Scala-specific) `callUDF` -> `call_udf`

### Why are the changes needed?

To keep the consistent naming in APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it deprecates some APIs and add new renamed APIs as described above.

### How was this patch tested?

Unittests were added.

Closes #31408 from HyukjinKwon/SPARK-34306.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-02 09:29:40 +09:00
itholic 28131a7794 [SPARK-34190][DOCS] Supplement the description for Python Package Management
### What changes were proposed in this pull request?

This PR supplements the contents in the "Python Package Management".

If there is no Python installed in the local for all nodes when using `venv-pack`, job would fail as below.

```python
>>> from pyspark.sql.functions import pandas_udf
>>> pandas_udf('double')
... def pandas_plus_one(v: pd.Series) -> pd.Series:
...     return v + 1
...
>>> spark.range(10).select(pandas_plus_one("id")).show()
...
Cannot run program "./environment/bin/python": error=2, No such file or directory
...
```

This is because the Python in the [packed environment via `venv-pack` has a symbolic link](https://github.com/jcrist/venv-pack/issues/5) that connects Python to the local one.

To avoid this confusion, it seems better to have an additional explanation for this.

### Why are the changes needed?

To provide more detailed information to users so that they don’t get confused

### Does this PR introduce _any_ user-facing change?

Yes, this PR fixes the part of "Python Package Management"  in the "User Guide" documents.

### How was this patch tested?

Manually built the doc.

![Screen Shot 2021-01-21 at 7 10 38 PM](https://user-images.githubusercontent.com/44108233/105336258-5e8bec00-5c1c-11eb-870c-86acfc77c082.png)

Closes #31280 from itholic/SPARK-34190.

Authored-by: itholic <haejoon309@naver.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-21 22:15:42 +09:00
Huaxin Gao f3548837c6 [SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector

### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.

### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures",  numTopFeatures=100)
```

Or

numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures",  numTopFeatures=100)
```

### How was this patch tested?
Add Unit test

Closes #31160 from huaxingao/UnivariateSelector.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-01-16 11:09:23 +08:00
HyukjinKwon aa388cf3d0 [SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to:
- Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs
- `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page
- Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).
- Mention other migration guides as well because PySpark can get affected by it.

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It fixes user-facing docs. However, it's not released out yet.

### How was this patch tested?

Manually tested by running:

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

Closes #31082 from HyukjinKwon/SPARK-34041.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-08 09:28:31 +09:00
HyukjinKwon 329850c667 [SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/29703.
It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there:
- https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html
- https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support
- http://crs4.github.io/pydoop/_pydoop1/installation.html

### Why are the changes needed?

To avoid the environment variables is unexpectedly conflicted.

### Does this PR introduce _any_ user-facing change?

It renames the environment variable but it's not released yet.

### How was this patch tested?

Existing unittests will test.

Closes #31028 from HyukjinKwon/SPARK-32017-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-01-05 17:21:32 +09:00
HyukjinKwon 6b86aa0b52 [SPARK-33984][PYTHON] Upgrade to Py4J 0.10.9.1
### What changes were proposed in this pull request?

This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (4152353ac1).

### Why are the changes needed?

To leverage fixes from the upstream in Py4J.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Jenkins build and GitHub Actions will test it out.

Closes #31009 from HyukjinKwon/SPARK-33984.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-04 10:23:38 -08:00
Gabor Somogyi 678294ddc2 [SPARK-33824][PYTHON][DOCS][FOLLOW-UP] Clarify about PYSPARK_DRIVER_PYTHON and spark.yarn.appMasterEnv.PYSPARK_PYTHON
### What changes were proposed in this pull request?

This PR proposes to clarify:
- `PYSPARK_DRIVER_PYTHON` should not be set for cluster modes in YARN and Kubernates.
- `spark.yarn.appMasterEnv.PYSPARK_PYTHON` is not required in YARN. This is just another way to set `PYSPARK_PYTHON` that is specific for a Spark application.

### Why are the changes needed?

To clarify what's required and not.

### Does this PR introduce _any_ user-facing change?

Yes, this is a user-facing doc change.

### How was this patch tested?

Manually tested.

Note that this credits to gaborgsomogyi who actually tested and raised a doubt about this offline to me.
I also manually tested all again to double check.

Closes #30938 from HyukjinKwon/SPARK-33824-followup.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 09:52:42 +09:00
HyukjinKwon 6315118676 [SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page
### What changes were proposed in this pull request?

This PR proposes to restructure and refine the Python dependency management page.
I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation.
FWIW, it has been reviewed by some tech writers and engineers.

I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It's doc change but only in unreleased bracnhs for now.

### How was this patch tested?

I manually built the docs as:

```bash
cd python/docs
make clean html
open
```

Closes #30822 from HyukjinKwon/SPARK-33824.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-18 10:03:07 +09:00
HyukjinKwon 5250841537
[SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation style
### What changes were proposed in this pull request?

This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.

Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.

### Why are the changes needed?

To tell developers that PySpark now follows NumPy documentation style.

### Does this PR introduce _any_ user-facing change?

No, it's a change in unreleased branches yet.

### How was this patch tested?

Manually tested via `make clean html` under `python/docs`:

![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)

Closes #30622 from HyukjinKwon/SPARK-33256.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 01:22:24 -08:00
HyukjinKwon 1a042cc414 [SPARK-33530][CORE] Support --archives and spark.archives option natively
### What changes were proposed in this pull request?

TL;DR:
- This PR completes the support of archives in Spark itself instead of Yarn-only
  - It makes `--archives` option work in other cluster modes too and adds `spark.archives` configuration.
-  After this PR, PySpark users can leverage Conda to ship Python packages together as below:
    ```python
    conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
    conda activate pyspark_env
    conda pack -f -o pyspark_env.tar.gz
    PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment
   ```
- Issue a warning that undocumented and hidden behavior of partial archive handling in `spark.files` / `SparkContext.addFile` will be deprecated, and users can use `spark.archives` and `SparkContext.addArchive`.

This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode:

```bash
./bin/spark-submit --help
```

```
Options:
...
 Spark on YARN only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
```

This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.

Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).

The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it.

Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well.

### Why are the changes needed?

To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.

### Does this PR introduce _any_ user-facing change?

Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`.

### How was this patch tested?

I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.

Closes #30486 from HyukjinKwon/native-archive.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-01 13:43:02 +09:00
Weichen Xu 596fbc1d29 [SPARK-33556][ML] Add array_to_vector function for dataframe column
### What changes were proposed in this pull request?

Add array_to_vector function for dataframe column

### Why are the changes needed?
Utility function for array to vector conversion.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
scala unit test & doctest.

Closes #30498 from WeichenXu123/array_to_vec.

Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-01 09:52:19 +09:00
Josh Soref 13fd272cd3 Spelling r common dev mlib external project streaming resource managers python
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-27 10:22:45 -06:00
zero323 d082ad0abf [SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig functions in PySpark and SparkR
### What changes were proposed in this pull request?

This PR adds the following functions (introduced in Scala API with SPARK-33061):

- `acosh`
- `asinh`
- `atanh`

to Python and R.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

New functions.

### How was this patch tested?

New unit tests.

Closes #30501 from zero323/SPARK-33563.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-27 11:00:09 +09:00
zero323 01321bc0fe [SPARK-33252][PYTHON][DOCS] Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
### What changes were proposed in this pull request?

This PR proposes migration of `pyspark.mllib` to NumPy documentation style.

### Why are the changes needed?

To improve documentation style.

Before:

![old](https://user-images.githubusercontent.com/1554276/100097941-90234980-2e5d-11eb-8b4d-c25d98d85191.png)

After:

![new](https://user-images.githubusercontent.com/1554276/100097966-987b8480-2e5d-11eb-9e02-07b18c327624.png)

### Does this PR introduce _any_ user-facing change?

Yes, this changes both rendered HTML docs and console representation (SPARK-33243).

### How was this patch tested?

`dev/lint-python` and manual inspection.

Closes #30413 from zero323/SPARK-33252.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-25 10:24:41 +09:00
HyukjinKwon 02d410a18c [MINOR][DOCS] Document 'without' value for HADOOP_VERSION in pip installation
### What changes were proposed in this pull request?

I believe it's self-descriptive.

### Why are the changes needed?

To document supported features.

### Does this PR introduce _any_ user-facing change?

Yes, the docs are updated. It's master only.

### How was this patch tested?

Manually built the docs via `cd python/docs` and `make clean html`:

![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png)

Closes #30436 from HyukjinKwon/minor-doc-fix.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-20 13:14:20 +09:00
Bryan Cutler 8e2a0bdce7 [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow
### What changes were proposed in this pull request?

This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.

### Why are the changes needed?

MapType was previous unsupported with Arrow.

### Does this PR introduce _any_ user-facing change?

User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs.

### How was this patch tested?

Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.

Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-18 21:18:19 +09:00
zero323 52073ef8ac [SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)
### What changes were proposed in this pull request?

This PR proposes migration of Core to NumPy documentation style.

### Why are the changes needed?

To improve documentation style.

### Does this PR introduce _any_ user-facing change?

Yes, this changes both rendered HTML docs and console representation (SPARK-33243).

### How was this patch tested?

dev/lint-python and manual inspection.

Closes #30320 from zero323/SPARK-33254.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-16 10:21:50 +09:00
HyukjinKwon 9818f079aa [SPARK-33243][PYTHON][BUILD] Add numpydoc into documentation dependency
### What changes were proposed in this pull request?

This PR proposes to initiate the migration to NumPy documentation style (from reST style) in PySpark docstrings.
This PR also adds one migration example of `SparkContext`.

- **Before:**
    ...
    ![Screen Shot 2020-10-26 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/97161090-a8ea0200-17c0-11eb-8204-0e70d18fc571.png)
    ...
    ![Screen Shot 2020-10-26 at 7 02 09 PM](https://user-images.githubusercontent.com/6477701/97161100-aab3c580-17c0-11eb-92ad-f5ad4441ce16.png)
    ...

- **After:**

    ...
    ![Screen Shot 2020-10-26 at 7 24 08 PM](https://user-images.githubusercontent.com/6477701/97161219-d636b000-17c0-11eb-80ab-d17a570ecb4b.png)
    ...

See also https://numpydoc.readthedocs.io/en/latest/format.html

### Why are the changes needed?

There are many reasons for switching to NumPy documentation style.

1. Arguably reST style doesn't fit well when the docstring grows large because it provides (arguably) less structures and syntax.

2. NumPy documentation style provides a better human readable docstring format. For example, notebook users often just do `help(...)` by `pydoc`.

3. NumPy documentation style is pretty commonly used in data science libraries, for example, pandas, numpy, Dask, Koalas,
matplotlib, ... Using NumPy documentation style can give users a consistent documentation style.

### Does this PR introduce _any_ user-facing change?

The dependency itself doesn't change anything user-facing.
The documentation change in `SparkContext` does, as shown above.

### How was this patch tested?

Manually tested via running `cd python` and `make clean html`.

Closes #30149 from HyukjinKwon/SPARK-33243.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 14:03:57 +09:00
HyukjinKwon 7cdc921bc0 [SPARK-32188][PYTHON][DOCS][FOLLOW-UP] Document Column APIs in API reference
### What changes were proposed in this pull request?

This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation.

### Why are the changes needed?

To document common APIs in PySpark.

### Does this PR introduce _any_ user-facing change?

Yes, `Column.*` will be shown in API reference page.

### How was this patch tested?

Manually tested via `cd python` and `make clean html`.

Closes #30150 from HyukjinKwon/SPARK-32188.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 09:52:09 +09:00
zero323 d7f15b025b [SPARK-33003][PYTHON][DOCS] Add type hints guidelines to the documentation
### What changes were proposed in this pull request?

Add type hints guidelines to developer docs.

### Why are the changes needed?

Since it is a new and still somewhat evolving feature, we should provided clear guidelines for potential contributors.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #30094 from zero323/SPARK-33003.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-24 10:00:04 +09:00