spark-instrumented-optimizer/python/pyspark/sql
Takuya UESHIN 63c5bf13ce [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
## What changes were proposed in this pull request?

In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g.,

```python
from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()
```

raises the following exception:

```
...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)

...
```

Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2.

This pr adds a workaround for the case.

## How was this patch tested?

Added a test and existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #20507 from ueshin/issues/SPARK-23334.
2018-02-06 18:30:50 +09:00
..
__init__.py [SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark 2017-11-02 15:22:52 +01:00
catalog.py [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark 2018-01-18 14:51:05 +09:00
column.py [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column 2017-08-24 20:29:03 +09:00
conf.py [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code 2016-05-23 18:14:48 -07:00
context.py [SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark 2018-01-18 14:51:05 +09:00
dataframe.py [SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. 2018-02-06 14:52:25 +08:00
functions.py [SPARK-23261][PYSPARK] Rename Pandas UDFs 2018-01-30 21:55:55 +09:00
group.py [SPARK-23261][PYSPARK] Rename Pandas UDFs 2018-01-30 21:55:55 +09:00
readwriter.py [SPARK-22818][SQL] csv escape of quote escape 2017-12-29 07:30:06 +08:00
session.py [SPARK-23228][PYSPARK] Add Python Created jsparkSession to JVM's defaultSession 2018-01-31 20:04:51 +09:00
streaming.py [SPARK-23143][SS][PYTHON] Added python API for setting continuous trigger 2018-01-18 12:25:52 -08:00
tests.py [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. 2018-02-06 18:30:50 +09:00
types.py [SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. 2018-02-06 14:52:25 +08:00
udf.py [SPARK-23261][PYSPARK] Rename Pandas UDFs 2018-01-30 21:55:55 +09:00
utils.py [SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error messages to show actual versions. 2017-12-25 20:29:10 +09:00
window.py [SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames 2016-12-02 17:39:28 -08:00