[SPARK-27995][PYTHON] Note the difference between str of Python 2 and 3 at Arrow optimized

## What changes were proposed in this pull request? When Arrow optimization is enabled in Python 2.7, ```python import pandas pdf = pandas.DataFrame(["test1", "test2"]) df = spark.createDataFrame(pdf) df.show() ``` I got the following output: ``` +----------------+ | 0| +----------------+ |[74 65 73 74 31]| |[74 65 73 74 32]| +----------------+ ``` This looks because Python's `str` and `byte` are same. it does look right: ```python >>> str == bytes True >>> isinstance("a", bytes) True ``` To cut it short: 1. Python 2 treats `str` as `bytes`. 2. PySpark added some special codes and hacks to recognizes `str` as string types. 3. PyArrow / Pandas followed Python 2 difference To fix, we have two options: 1. Fix it to match the behaviour to PySpark's 2. Note the differences but Python 2 is deprecated anyway. I think it's better to just note it and for go option 2. ## How was this patch tested? Manually tested. Doc was checked too: ![Screen Shot 2019-06-11 at 6 40 07 PM](https://user-images.githubusercontent.com/6477701/59261402-59ad3b00-8c78-11e9-94a6-3236a2c338d4.png) Closes #24838 from HyukjinKwon/SPARK-27995. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-06-11 18:43:59 +09:00 · 2019-06-11 18:43:59 +09:00 · 1217996f15
parent 6284ac7088
commit 1217996f15
1 changed files with 5 additions and 0 deletions
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@ -653,6 +653,11 @@ class SparkSession(object):

        .. note:: Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.

+        .. note:: When Arrow optimization is enabled, strings inside Pandas DataFrame in Python
+            2 are converted into bytes as they are bytes in Python 2 whereas regular strings are
+            left as strings. When using strings in Python 2, use unicode `u""` as Python standard
+            practice.
+
        >>> l = [('Alice', 1)]
        >>> spark.createDataFrame(l).collect()
        [Row(_1=u'Alice', _2=1)]