### What changes were proposed in this pull request?
This PR explicitly mention that the requirement of Iterator of Series to Iterator of Series and Iterator of Multiple Series to Iterator of Series (previously Scalar Iterator pandas UDF).
The actual limitation of this UDF is the same length of the _entire input and output_, instead of each series's length. Namely you can do something as below:
```python
from typing import Iterator, Tuple
import pandas as pd
from pyspark.sql.functions import pandas_udf
pandas_udf("long")
def func(
iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
return iter([pd.concat(iterator)])
spark.range(100).select(func("id")).show()
```
This characteristic allows you to prefetch the data from the iterator to speed up, compared to the regular Scalar to Scalar (previously Scalar pandas UDF).
### Why are the changes needed?
To document the correct restriction and characteristics of a feature.
### Does this PR introduce any user-facing change?
Yes in the documentation but only in unreleased branches.
### How was this patch tested?
Github Actions should test the documentation build
Closes#28160 from HyukjinKwon/SPARK-30722-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos and phrases in the `/docs` directory. To find them, I run the Intellij typo checker.
### Why are the changes needed?
For better documents.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#27819 from maropu/TypoFix-20200306.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Add doc for recommended pandas and pyarrow versions.
### Why are the changes needed?
The recommended versions are those that have been thoroughly tested by Spark CI. Other versions may be used at the discretion of the user.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
NA
Closes#27587 from BryanCutler/python-doc-rec-pandas-pyarrow-SPARK-30834-3.0.
Lead-authored-by: Bryan Cutler <cutlerb@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR targets to document the Pandas UDF redesign with type hints introduced at SPARK-28264.
Mostly self-describing; however, there are few things to note for reviewers.
1. This PR replace the existing documentation of pandas UDFs to the newer redesign to promote the Python type hints. I added some words that Spark 3.0 still keeps the compatibility though.
2. This PR proposes to name non-pandas UDFs as "Pandas Function API"
3. SCALAR_ITER become two separate sections to reduce confusion:
- `Iterator[pd.Series]` -> `Iterator[pd.Series]`
- `Iterator[Tuple[pd.Series, ...]]` -> `Iterator[pd.Series]`
4. I removed some examples that look overkill to me.
5. I also removed some information in the doc, that seems duplicating or too much.
### Why are the changes needed?
To document new redesign in pandas UDF.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests should cover.
Closes#27466 from HyukjinKwon/SPARK-30722.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fixed typo in `docs` directory and in other directories
1. Find typo in `docs` and apply fixes to files in all directories
2. Fix `the the` -> `the`
### Why are the changes needed?
Better readability of documents
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test needed
Closes#26976 from kiszk/typo_20191221.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR adds some extra documentation for the new Cogrouped map Pandas udfs. Specifically:
- Updated the usage guide for the new `COGROUPED_MAP` Pandas udfs added in https://github.com/apache/spark/pull/24981
- Updated the docstring for pandas_udf to include the COGROUPED_MAP type as suggested by HyukjinKwon in https://github.com/apache/spark/pull/25939Closes#26110 from d80tb7/SPARK-29126-cogroup-udf-usage-guide.
Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.
### Why are the changes needed?
Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.
Closes#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Add docs for `SCALAR_ITER` Pandas UDF.
cc: WeichenXu123 HyukjinKwon
## How was this patch tested?
Tested example code manually.
Closes#24897 from mengxr/SPARK-28056.
Authored-by: Xiangrui Meng <meng@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
## What changes were proposed in this pull request?
`spark.sql.execution.arrow.enabled` was added when we add PySpark arrow optimization.
Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration `spark.sql.execution.arrow.enabled`.
There look two issues about this:
1. `spark.sql.execution.arrow.enabled` in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first.
2. Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally.
This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization:
- Deprecate `spark.sql.execution.arrow.enabled`
- Add `spark.sql.execution.arrow.pyspark.enabled` (fallback to `spark.sql.execution.arrow.enabled`)
- Add `spark.sql.execution.arrow.sparkr.enabled`
- Deprecate `spark.sql.execution.arrow.fallback.enabled`
- Add `spark.sql.execution.arrow.pyspark.fallback.enabled ` (fallback to `spark.sql.execution.arrow.fallback.enabled`)
Note that `spark.sql.execution.arrow.maxRecordsPerBatch` is used within JVM side for both.
Note that `spark.sql.execution.arrow.fallback.enabled` was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback.
## How was this patch tested?
Manually tested and some unittests were added.
Closes#24700 from HyukjinKwon/separate-sparkr-arrow.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Looks updating documentation from 0.8.0 to 0.12.1 was missed.
## How was this patch tested?
N/A
Closes#24504 from HyukjinKwon/SPARK-27276-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
## What changes were proposed in this pull request?
Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.
## How was this patch tested?
Doc build
Closes#24243 from srowen/SPARK-26918.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Fix Typos.
## How was this patch tested?
NA
Closes#23145 from kjmrknsn/docUpdate.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This PR targets to document binary type in "Apache Arrow in Spark".
## How was this patch tested?
Manually built the documentation and checked.
Closes#22871 from HyukjinKwon/SPARK-25179.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
1. Split the main page of sql-programming-guide into 7 parts:
- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference
2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)
## How was this patch tested?
Local test with jekyll build/serve.
Closes#22746 from xuanyuanking/SPARK-24499.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>