Commit graph

20 commits

Author SHA1 Message Date
HyukjinKwon e2c7bfce40 [SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default)
### What changes were proposed in this pull request?

This PR proposes to simplify the exception messages from Python UDFS.

Currently, the exception message from Python UDFs is as below:

```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases.

This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc.

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

The trackback will be shown from the point when any non-PySpark file is seen in the traceback.

### Why are the changes needed?

Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary.

This configuration enables the ability to show simplified tracebacks which users will likely be most interested in.

### Does this PR introduce _any_ user-facing change?

By default, no. It adds one configuration that simplifies the exception message. See the example above.

### How was this patch tested?

Manually tested:

```bash
$ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true
```
```python
from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

and unittests were also added.

Closes #30309 from HyukjinKwon/SPARK-33407.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-17 14:15:31 +09:00
zero323 52073ef8ac [SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)
### What changes were proposed in this pull request?

This PR proposes migration of Core to NumPy documentation style.

### Why are the changes needed?

To improve documentation style.

### Does this PR introduce _any_ user-facing change?

Yes, this changes both rendered HTML docs and console representation (SPARK-33243).

### How was this patch tested?

dev/lint-python and manual inspection.

Closes #30320 from zero323/SPARK-33254.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-16 10:21:50 +09:00
zero323 72da6f86cf [SPARK-33002][PYTHON] Remove non-API annotations
### What changes were proposed in this pull request?

This PR:

- removes annotations for modules which are not part of the public API.
- removes `__init__.pyi` files, if no annotations, beyond exports, are present.

### Why are the changes needed?

Primarily to reduce maintenance overhead and as requested in the comments to https://github.com/apache/spark/pull/29591

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests and additional MyPy checks:

```
mypy --no-incremental --config python/mypy.ini python/pyspark
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```

Closes #29879 from zero323/SPARK-33002.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 19:53:59 +09:00
HyukjinKwon 89d9b7cc64 [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
### What changes were proposed in this pull request?

This PR proposes:

1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`.

    This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below:

    ```python
    import pyspark

    spark.sparkContext.setLocalProperty("a", "hi")
    def print_prop():
        print(spark.sparkContext.getLocalProperty("a"))

    pyspark.InheritableThread(target=print_prop).start()
    ```

    ```
    hi
    ```

2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify:

    ```bash
    PYSPARK_PIN_THREAD=true ./bin/pyspark
    ```

    ```python
    >>> from threading import Thread
    >>> Thread(target=lambda: spark.range(1000).collect()).start()
    >>> Thread(target=lambda: spark.range(1000).collect()).start()
    >>> Thread(target=lambda: spark.range(1000).collect()).start()
    >>> spark._jvm._gateway_client.deque
    deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
    >>> Thread(target=lambda: spark.range(1000).collect()).start()
    >>> spark._jvm._gateway_client.deque
    deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
    ```

    This issue is fixed now.

3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue.

### Why are the changes needed?

To support pinned thread mode properly without a resource leak, and a proper inheritable local properties.

### Does this PR introduce _any_ user-facing change?

Yes, it adds an API `InheritableThread` class for pinned thread mode.

### How was this patch tested?

Manually tested as described above, and unit test was added as well.

Closes #28968 from HyukjinKwon/SPARK-32010.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-30 10:15:25 +09:00
HyukjinKwon 4ad9bfd53b [SPARK-32138] Drop Python 2.7, 3.4 and 3.5
### What changes were proposed in this pull request?

This PR aims to drop Python 2.7, 3.4 and 3.5.

Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.

### Why are the changes needed?

 1. Unsupport EOL Python versions
 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
 4. Users can use Python type hints with Pandas UDFs without thinking about Python version
 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.

### Does this PR introduce _any_ user-facing change?

Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.

### How was this patch tested?

Manually tested and also tested in Jenkins.

Closes #28957 from HyukjinKwon/SPARK-32138.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-14 11:22:44 +09:00
HyukjinKwon feeca63198 [SPARK-32011][PYTHON][CORE] Remove warnings about pin-thread modes and guide to use collectWithJobGroup
### What changes were proposed in this pull request?

This PR proposes to remove the warning about multi-thread in local properties, and change the guide to use `collectWithJobGroup` for multi-threads for now because:
- It is too noisy to users who don't use multiple threads - the number of this single thread case is arguably more prevailing.
- There was a critical issue found about pin-thread mode SPARK-32010, which will be fixed in Spark 3.1.
- To smoothly migrate, `RDD.collectWithJobGroup` was added, which will be deprecated in Spark 3.1 with SPARK-32010 fixed.

I will target to deprecate `RDD.collectWithJobGroup`, and make this pin-thread mode stable in Spark 3.1. In the future releases, I plan to make this mode as a default mode, and remove `RDD.collectWithJobGroup` away.

### Why are the changes needed?

To avoid guiding users a feature with a critical issue, and provide a proper workaround for now.

### Does this PR introduce _any_ user-facing change?

Yes, warning message and documentation.

### How was this patch tested?

Manually tested:

Before:

```
>>> spark.sparkContext.setLocalProperty("a", "b")
/.../spark/python/pyspark/util.py:141: UserWarning: Currently, 'setLocalProperty' (set to local
properties) with multiple threads does not properly work.
Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple
threads on PVM, which fails to isolate local properties for each thread on PVM.
To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However,
note that it cannot inherit the local properties from the parent thread although it isolates each
thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread
 to the child thread when you create another thread.
```

After:
```
>>> spark.sparkContext.setLocalProperty("a", "b")
```

Closes #28845 from HyukjinKwon/SPARK-32011.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-17 12:10:12 +09:00
Thomas Graves 95aec091e4 [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests
### What changes were proposed in this pull request?

As part of the Stage level scheduling features, add the Python api's to set resource profiles.
This also adds the functionality to properly apply the pyspark memory configuration when specified in the ResourceProfile. The pyspark memory configuration is being passed in the task local properties. This was an easy way to get it to the PythonRunner that needs it. I modeled this off how the barrier task scheduling is passing the addresses. As part of this I added in the JavaRDD api's because those are needed by python.

### Why are the changes needed?

python api for this feature

### Does this PR introduce any user-facing change?

Yes adds the java and python apis for user to specify a ResourceProfile to use stage level scheduling.

### How was this patch tested?

unit tests and manually tested on yarn. Tests also run to verify it errors properly on standalone and local mode where its not yet supported.

Closes #28085 from tgravescs/SPARK-29641-pr-base.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 10:20:39 +09:00
HyukjinKwon 74cb1ffd68 [SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode
### What changes were proposed in this pull request?

This PR proposes to show different warning message when the pinned thread mode is enabled:

When enabled:

> PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.

When disabled:

> Currently, 'setLocalProperty' (set to local properties) with multiple threads does not properly work.
> Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
> To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.

### Why are the changes needed?

Currently, it shows the same warning message regardless of PYSPARK_PIN_THREAD being set. In the warning message it says "you can set PYSPARK_PIN_THREAD to true ..." which is confusing.

### Does this PR introduce any user-facing change?

Documentation and warning message as shown above.

### How was this patch tested?

Manually tested.

```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```

```python
sc.setJobGroup("a", "b")
```

```
.../pyspark/util.py:141: UserWarning: PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
  warnings.warn(msg, UserWarning)
```

```bash
$ ./bin/pyspark
```

```python
sc.setJobGroup("a", "b")
```

```
.../pyspark/util.py:141: UserWarning: Currently, 'setJobGroup' (set to local properties) with multiple threads does not properly work.
Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
  warnings.warn(msg, UserWarning)
```

Closes #26588 from HyukjinKwon/SPARK-22340.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-21 10:54:01 +09:00
Liang-Chi Hsieh 2bc3fff13b [SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?

This patch upgrades cloudpickle to 1.0.0 version.

Main changes:

1. cleanup unused functions: 936f16fac8
2. Fix relative imports inside function body: 31ecdd6f57
3. Write kw only arguments to pickle: 6cb4718528

### Why are the changes needed?

We should include new bug fix like 6cb4718528, because users might use such python function in PySpark.

```python
>>> def f(a, *, b=1):
...   return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:>                                                        (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
    process()
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
    serializer.dump_stream(out_iter, outfile)
  File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```

After:

```python
>>> def f(a, *, b=1):
...   return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```

### Does this PR introduce any user-facing change?

Yes. This fixes two bugs when pickling Python functions.

### How was this patch tested?

Existing tests.

Closes #26009 from viirya/upgrade-cloudpickle.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 19:20:51 +09:00
Gabor Somogyi 3729efb4d0 [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs
## What changes were proposed in this pull request?

Avro is built-in but external data source module since Spark 2.4 but  `from_avro` and `to_avro` APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

## How was this patch tested?

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

Closes #23797 from gaborgsomogyi/SPARK-26856.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-03-11 10:15:07 +09:00
Sean Owen 08c76b5d39 [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
(This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)

## What changes were proposed in this pull request?

Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.

## How was this patch tested?

Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.

Closes #22400 from srowen/SPARK-25238.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-09-13 11:19:43 +08:00
edorigatti 3e5b4ae63a [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor
## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker

## How does this work?

The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
 - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
 - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.

## How was this patch tested?

Same tests, plus tests for pandas UDFs

Author: edorigatti <emilio.dorigatti@gmail.com>

Closes #21467 from e-dorigatti/fix_udf_hack.
2018-06-11 10:15:42 +08:00
e-dorigatti 0ebb0c0d4d [SPARK-23754][PYTHON] Re-raising StopIteration in client code
## What changes were proposed in this pull request?

Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s

## How was this patch tested?

Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way?

## License

This is my original work, licensed in the same way as spark

Author: e-dorigatti <emilio.dorigatti@gmail.com>
Author: edorigatti <emilio.dorigatti@gmail.com>

Closes #21383 from e-dorigatti/fix_spark_23754.
2018-05-30 18:11:33 +08:00
Liang-Chi Hsieh b54bbe57b3 [SPARK-24131][PYSPARK][FOLLOWUP] Add majorMinorVersion API to PySpark for determining Spark versions
## What changes were proposed in this pull request?

More close to Scala API behavior when can't parse input by throwing exception. Add tests.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #21211 from viirya/SPARK-24131-followup.
2018-05-08 21:22:54 +08:00
Liang-Chi Hsieh e15850be6e [SPARK-24131][PYSPARK] Add majorMinorVersion API to PySpark for determining Spark versions
## What changes were proposed in this pull request?

We need to determine Spark major and minor versions in PySpark. We can add a `majorMinorVersion` API to PySpark which is similar to the Scala API in `VersionUtils.majorMinorVersion`.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #21203 from viirya/SPARK-24131.
2018-05-02 10:55:01 +08:00
Bryan Cutler a9350d7095 [SPARK-23700][PYTHON] Cleanup imports in pyspark.sql
## What changes were proposed in this pull request?

This cleans up unused imports, mainly from pyspark.sql module.  Added a note in function.py that imports `UserDefinedFunction` only to maintain backwards compatibility for using `from pyspark.sql.function import UserDefinedFunction`.

## How was this patch tested?

Existing tests and built docs.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #20892 from BryanCutler/pyspark-cleanup-imports-SPARK-23700.
2018-03-26 12:42:32 +09:00
Benjamin Peterson 7013eea11c [SPARK-23522][PYTHON] always use sys.exit over builtin exit
The exit() builtin is only for interactive use. applications should use sys.exit().

## What changes were proposed in this pull request?

All usage of the builtin `exit()` function is replaced by `sys.exit()`.

## How was this patch tested?

I ran `python/run-tests`.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Benjamin Peterson <benjamin@python.org>

Closes #20682 from benjaminp/sys-exit.
2018-03-08 20:38:34 +09:00
Li Jin 2cb23a8f51 [SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF
## What changes were proposed in this pull request?

This PR proposes to support an alternative function from with group aggregate pandas UDF.

The current form:
```
def foo(pdf):
    return ...
```
Takes a single arg that is a pandas DataFrame.

With this PR, an alternative form is supported:
```
def foo(key, pdf):
    return ...
```
The alternative form takes two argument - a tuple that presents the grouping key, and a pandas DataFrame represents the data.

## How was this patch tested?

GroupbyApplyTests

Author: Li Jin <ice.xelloss@gmail.com>

Closes #20295 from icexelloss/SPARK-23011-groupby-apply-key.
2018-03-08 20:29:07 +09:00
hyukjinkwon fab563b9bd [SPARK-23517][PYTHON] Make pyspark.util._exception_message produce the trace from Java side by Py4JJavaError
## What changes were proposed in this pull request?

This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.

Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:

```python
>>> from pyspark.util import _exception_message
>>> try:
...     sc._jvm.java.lang.String(None)
... except Exception as e:
...     pass
...
>>> e.message
''
```

Seems we should use `str` instead for now:

 aa6c53b590/py4j-python/src/py4j/protocol.py (L412)

but this doesn't address the problem with non-ascii string from Java side -
 `https://github.com/bartdag/py4j/issues/306`

So, we could directly call `__str__()`:

```python
>>> e.__str__()
u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
```

which doesn't type coerce unicodes to `str` in Python 2.

This can be actually a problem:

```python
from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```

**Before**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
    raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError:
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```

**After**

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
    raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/pyspark/worker.py", line 245, in main
    process()
  File "/.../spark/python/pyspark/worker.py", line 240, in process
...
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```

## How was this patch tested?

Manually tested and unit tests were added.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #20680 from HyukjinKwon/SPARK-23517.
2018-03-01 00:44:13 +09:00
David Gingrich 6297697f97 [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3
## What changes were proposed in this pull request?

Added `util._message_exception` helper to use `str(e)` when `e.message` is unavailable (Python3).  Grepped for all occurrences of `.message` in `pyspark/` and these were the only occurrences.

## How was this patch tested?

- Doctests for helper function

## Legal

This is my original work and I license the work to the project under the project’s open source license.

Author: David Gingrich <david@textio.com>

Closes #16845 from dgingrich/topic-spark-19505-py3-exceptions.
2017-04-11 12:18:31 -07:00