spark-instrumented-optimizer/python/pyspark/sql
HyukjinKwon 4fafdcd63b [SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF
### What changes were proposed in this pull request?

This PR proposes to improve the error message from Scalar iterator pandas UDF.

### Why are the changes needed?

To show the correct error messages.

### Does this PR introduce any user-facing change?

Yes, but only in unreleased branches.

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(1)

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
```
```python
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType

pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(iterator):
      for _ in iterator:
            yield pd.Series(list(range(20)))

spark.range(10).repartition(1).select(pandas_plus_one("id")).show()
```

**Before:**

```
RuntimeError: The number of output rows of pandas iterator UDF should
be the same with input rows. The input rows number is 10 but the output
rows number is 1.
```
```
AssertionError: Pandas MAP_ITER UDF outputted more rows than input rows.
```

**After:**

```
RuntimeError: The length of output in Scalar iterator pandas UDF should be
the same with the input's; however, the length of output was 1 and the length
of input was 10.
```
```
AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.
```

### How was this patch tested?

Unittests were fixed accordingly.

Closes #28135 from HyukjinKwon/SPARK-26412-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-09 13:14:41 +09:00
..
avro [SPARK-27506][SQL][FOLLOWUP] Use option avroSchema to specify an evolved schema in from_avro 2019-12-30 18:14:21 +09:00
pandas [SPARK-31287][PYTHON][SQL] Ignore type hints in groupby.(cogroup.)applyInPandas and mapInPandas 2020-03-29 13:59:18 +09:00
tests [SPARK-26412][PYTHON][FOLLOW-UP] Improve error messages in Scala iterator pandas UDF 2020-04-09 13:14:41 +09:00
__init__.py [SPARK-31088][SQL] Add back HiveContext and createExternalTable 2020-03-26 23:51:15 -07:00
catalog.py [SPARK-31088][SQL] Add back HiveContext and createExternalTable 2020-03-26 23:51:15 -07:00
column.py [SPARK-30859][PYSPARK][DOCS][MINOR] Fixed docstring syntax issues preventing proper compilation of documentation 2020-02-18 16:46:45 +09:00
conf.py [SPARK-23698][PYTHON] Resolve undefined names in Python 3 2018-08-22 10:06:59 -07:00
context.py [SPARK-31088][SQL] Add back HiveContext and createExternalTable 2020-03-26 23:51:15 -07:00
dataframe.py [SPARK-31087] [SQL] Add Back Multiple Removed APIs 2020-03-28 22:05:16 -07:00
functions.py [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound 2020-03-31 15:16:17 +09:00
group.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
readwriter.py [SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp 2020-03-30 12:20:11 +08:00
session.py [SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted 2020-02-20 12:21:24 +09:00
streaming.py [SPARK-31286][SQL][DOC] Specify formats of time zone ID for JSON/CSV option and from/to_utc_timestamp 2020-03-30 12:20:11 +08:00
types.py [SPARK-30941][PYSPARK] Add a note to asDict to document its behavior when there are duplicate fields 2020-03-09 11:06:45 -07:00
udf.py [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints 2020-02-12 10:49:46 +09:00
utils.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
window.py [SPARK-30188][SQL] Resolve the failed unit tests when enable AQE 2020-01-13 22:55:19 +08:00