spark-instrumented-optimizer/python/docs/source
Hyukjin Kwon 747fe7282c [SPARK-35419][PYTHON] Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers:

```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

**Before**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

**After**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

Note that the traceback (`return f(*args, **kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging.

In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it.

### Why are the changes needed?

To show simplified exception messages to end users.

### Does this PR introduce _any_ user-facing change?

Yes, it will hide the internal Python worker traceback.

### How was this patch tested?

Existing test cases should cover.

Closes #32569 from HyukjinKwon/SPARK-35419.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-18 12:27:09 +09:00
..
_static Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
_templates/autosummary Spelling r common dev mlib external project streaming resource managers python 2020-11-27 10:22:45 -06:00
development [MINOR][DOCS] Replace http to https when possible in PySpark documentation 2021-02-23 11:18:47 +09:00
getting_started [SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst 2021-05-04 11:02:57 +09:00
migration_guide [SPARK-35419][PYTHON] Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default 2021-05-18 12:27:09 +09:00
reference [SPARK-33678][SQL] Product aggregation function 2021-03-02 16:51:07 +09:00
user_guide [MINOR][DOCS] Use ASCII characters when possible in PySpark documentation 2021-04-04 09:49:36 +03:00
conf.py [SPARK-34657][PYTHON][DOCS] Replace the tag of release to the hash to hide RC tags in Binder 2021-03-08 10:48:17 +09:00
index.rst [MINOR][DOCS] Use ASCII characters when possible in PySpark documentation 2021-04-04 09:49:36 +03:00