spark-instrumented-optimizer

History

Takuya UESHIN 63c5bf13ce [SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20507 from ueshin/issues/SPARK-23334.		2018-02-06 18:30:50 +09:00
..
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark	2018-01-18 14:51:05 +09:00
column.py	[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column	2017-08-24 20:29:03 +09:00
conf.py	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code	2016-05-23 18:14:48 -07:00
context.py	[SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark	2018-01-18 14:51:05 +09:00
dataframe.py	[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame.	2018-02-06 14:52:25 +08:00
functions.py	[SPARK-23261][PYSPARK] Rename Pandas UDFs	2018-01-30 21:55:55 +09:00
group.py	[SPARK-23261][PYSPARK] Rename Pandas UDFs	2018-01-30 21:55:55 +09:00
readwriter.py	[SPARK-22818][SQL] csv escape of quote escape	2017-12-29 07:30:06 +08:00
session.py	[SPARK-23228][PYSPARK] Add Python Created jsparkSession to JVM's defaultSession	2018-01-31 20:04:51 +09:00
streaming.py	[SPARK-23143][SS][PYTHON] Added python API for setting continuous trigger	2018-01-18 12:25:52 -08:00
tests.py	[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.	2018-02-06 18:30:50 +09:00
types.py	[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame.	2018-02-06 14:52:25 +08:00
udf.py	[SPARK-23261][PYSPARK] Rename Pandas UDFs	2018-01-30 21:55:55 +09:00
utils.py	[SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error messages to show actual versions.	2017-12-25 20:29:10 +09:00
window.py	[SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames	2016-12-02 17:39:28 -08:00