[SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes #26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
parent
2b3c3793c9
commit
6390f02f9f
|
@ -219,3 +219,20 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim
|
|||
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when
|
||||
working with timestamps in `pandas_udf`s to get the best performance, see
|
||||
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
|
||||
|
||||
### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
|
||||
|
||||
Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be
|
||||
compatible with previous versions of Arrow <= 0.14.1. This is only necessary to do for PySpark
|
||||
users with versions 2.3.x and 2.4.x that have manually upgraded PyArrow to 0.15.0. The following
|
||||
can be added to `conf/spark-env.sh` to use the legacy Arrow IPC format:
|
||||
|
||||
```
|
||||
ARROW_PRE_0_15_IPC_FORMAT=1
|
||||
```
|
||||
|
||||
This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that
|
||||
is in Spark 2.3.x and 2.4.x. Not setting this environment variable will lead to a similar error as
|
||||
described in [SPARK-29367](https://issues.apache.org/jira/browse/SPARK-29367) when running
|
||||
`pandas_udf`s or `toPandas()` with Arrow enabled. More information about the Arrow IPC change can
|
||||
be read on the Arrow 0.15.0 release [blog](http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140).
|
||||
|
|
Loading…
Reference in a new issue