2017-04-11 15:18:31 -04:00
|
|
|
# -*- coding: utf-8 -*-
|
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
#
|
2018-03-08 06:29:07 -05:00
|
|
|
|
2018-05-01 22:55:01 -04:00
|
|
|
import re
|
2018-03-08 06:29:07 -05:00
|
|
|
import sys
|
[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?
This patch upgrades cloudpickle to 1.0.0 version.
Main changes:
1. cleanup unused functions: https://github.com/cloudpipe/cloudpickle/commit/936f16fac89986453c4bb3a4af9f04da16d30a9a
2. Fix relative imports inside function body: https://github.com/cloudpipe/cloudpickle/commit/31ecdd6f57c6013a1affb21f69e86e638f463710
3. Write kw only arguments to pickle: https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502
### Why are the changes needed?
We should include new bug fix like https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502, because users might use such python function in PySpark.
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:> (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
process()
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
serializer.dump_stream(out_iter, outfile)
File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```
After:
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```
### Does this PR introduce any user-facing change?
Yes. This fixes two bugs when pickling Python functions.
### How was this patch tested?
Existing tests.
Closes #26009 from viirya/upgrade-cloudpickle.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 06:20:51 -04:00
|
|
|
import traceback
|
[SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode
### What changes were proposed in this pull request?
This PR proposes to show different warning message when the pinned thread mode is enabled:
When enabled:
> PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
When disabled:
> Currently, 'setLocalProperty' (set to local properties) with multiple threads does not properly work.
> Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
> To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
### Why are the changes needed?
Currently, it shows the same warning message regardless of PYSPARK_PIN_THREAD being set. In the warning message it says "you can set PYSPARK_PIN_THREAD to true ..." which is confusing.
### Does this PR introduce any user-facing change?
Documentation and warning message as shown above.
### How was this patch tested?
Manually tested.
```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
sc.setJobGroup("a", "b")
```
```
.../pyspark/util.py:141: UserWarning: PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
warnings.warn(msg, UserWarning)
```
```bash
$ ./bin/pyspark
```
```python
sc.setJobGroup("a", "b")
```
```
.../pyspark/util.py:141: UserWarning: Currently, 'setJobGroup' (set to local properties) with multiple threads does not properly work.
Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
warnings.warn(msg, UserWarning)
```
Closes #26588 from HyukjinKwon/SPARK-22340.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-20 20:54:01 -05:00
|
|
|
import os
|
|
|
|
import warnings
|
2018-03-08 06:29:07 -05:00
|
|
|
import inspect
|
[SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace from Java side by Py4JJavaError
## What changes were proposed in this pull request?
This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.
Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:
```python
>>> from pyspark.util import _exception_message
>>> try:
... sc._jvm.java.lang.String(None)
... except Exception as e:
... pass
...
>>> e.message
''
```
Seems we should use `str` instead for now:
https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412
but this doesn't address the problem with non-ascii string from Java side -
`https://github.com/bartdag/py4j/issues/306`
So, we could directly call `__str__()`:
```python
>>> e.__str__()
u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
```
which doesn't type coerce unicodes to `str` in Python 2.
This can be actually a problem:
```python
from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError:
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```
**After**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/pyspark/worker.py", line 245, in main
process()
File "/.../spark/python/pyspark/worker.py", line 240, in process
...
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```
## How was this patch tested?
Manually tested and unit tests were added.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20680 from HyukjinKwon/SPARK-23517.
2018-02-28 10:44:13 -05:00
|
|
|
from py4j.protocol import Py4JJavaError
|
2017-04-11 15:18:31 -04:00
|
|
|
|
|
|
|
__all__ = []
|
|
|
|
|
|
|
|
|
|
|
|
def _exception_message(excp):
|
|
|
|
"""Return the message from an exception as either a str or unicode object. Supports both
|
|
|
|
Python 2 and Python 3.
|
|
|
|
|
|
|
|
>>> msg = "Exception message"
|
|
|
|
>>> excp = Exception(msg)
|
|
|
|
>>> msg == _exception_message(excp)
|
|
|
|
True
|
|
|
|
|
|
|
|
>>> msg = u"unicöde"
|
|
|
|
>>> excp = Exception(msg)
|
|
|
|
>>> msg == _exception_message(excp)
|
|
|
|
True
|
|
|
|
"""
|
[SPARK-23517][PYTHON] Make `pyspark.util._exception_message` produce the trace from Java side by Py4JJavaError
## What changes were proposed in this pull request?
This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.
Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:
```python
>>> from pyspark.util import _exception_message
>>> try:
... sc._jvm.java.lang.String(None)
... except Exception as e:
... pass
...
>>> e.message
''
```
Seems we should use `str` instead for now:
https://github.com/bartdag/py4j/blob/aa6c53b59027925a426eb09b58c453de02c21b7c/py4j-python/src/py4j/protocol.py#L412
but this doesn't address the problem with non-ascii string from Java side -
`https://github.com/bartdag/py4j/issues/306`
So, we could directly call `__str__()`:
```python
>>> e.__str__()
u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
```
which doesn't type coerce unicodes to `str` in Python 2.
This can be actually a problem:
```python
from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError:
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```
**After**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/pyspark/worker.py", line 245, in main
process()
File "/.../spark/python/pyspark/worker.py", line 240, in process
...
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```
## How was this patch tested?
Manually tested and unit tests were added.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #20680 from HyukjinKwon/SPARK-23517.
2018-02-28 10:44:13 -05:00
|
|
|
if isinstance(excp, Py4JJavaError):
|
|
|
|
# 'Py4JJavaError' doesn't contain the stack trace available on the Java side in 'message'
|
|
|
|
# attribute in Python 2. We should call 'str' function on this exception in general but
|
|
|
|
# 'Py4JJavaError' has an issue about addressing non-ascii strings. So, here we work
|
|
|
|
# around by the direct call, '__str__()'. Please see SPARK-23517.
|
|
|
|
return excp.__str__()
|
2017-04-11 15:18:31 -04:00
|
|
|
if hasattr(excp, "message"):
|
|
|
|
return excp.message
|
|
|
|
return str(excp)
|
|
|
|
|
|
|
|
|
2018-03-08 06:29:07 -05:00
|
|
|
def _get_argspec(f):
|
|
|
|
"""
|
|
|
|
Get argspec of a function. Supports both Python 2 and Python 3.
|
|
|
|
"""
|
[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor
## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker
## How does this work?
The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
- In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
- In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.
## How was this patch tested?
Same tests, plus tests for pandas UDFs
Author: edorigatti <emilio.dorigatti@gmail.com>
Closes #21467 from e-dorigatti/fix_udf_hack.
2018-06-10 22:15:42 -04:00
|
|
|
if sys.version_info[0] < 3:
|
2018-03-08 06:29:07 -05:00
|
|
|
argspec = inspect.getargspec(f)
|
|
|
|
else:
|
2018-05-30 06:11:33 -04:00
|
|
|
# `getargspec` is deprecated since python3.0 (incompatible with function annotations).
|
|
|
|
# See SPARK-23569.
|
2018-03-08 06:29:07 -05:00
|
|
|
argspec = inspect.getfullargspec(f)
|
|
|
|
return argspec
|
|
|
|
|
|
|
|
|
[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?
This patch upgrades cloudpickle to 1.0.0 version.
Main changes:
1. cleanup unused functions: https://github.com/cloudpipe/cloudpickle/commit/936f16fac89986453c4bb3a4af9f04da16d30a9a
2. Fix relative imports inside function body: https://github.com/cloudpipe/cloudpickle/commit/31ecdd6f57c6013a1affb21f69e86e638f463710
3. Write kw only arguments to pickle: https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502
### Why are the changes needed?
We should include new bug fix like https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502, because users might use such python function in PySpark.
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:> (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
process()
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
serializer.dump_stream(out_iter, outfile)
File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```
After:
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```
### Does this PR introduce any user-facing change?
Yes. This fixes two bugs when pickling Python functions.
### How was this patch tested?
Existing tests.
Closes #26009 from viirya/upgrade-cloudpickle.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 06:20:51 -04:00
|
|
|
def print_exec(stream):
|
|
|
|
ei = sys.exc_info()
|
|
|
|
traceback.print_exception(ei[0], ei[1], ei[2], None, stream)
|
|
|
|
|
|
|
|
|
2018-05-08 09:22:54 -04:00
|
|
|
class VersionUtils(object):
|
2018-05-01 22:55:01 -04:00
|
|
|
"""
|
2018-05-08 09:22:54 -04:00
|
|
|
Provides utility method to determine Spark versions with given input string.
|
|
|
|
"""
|
|
|
|
@staticmethod
|
|
|
|
def majorMinorVersion(sparkVersion):
|
|
|
|
"""
|
|
|
|
Given a Spark version string, return the (major version number, minor version number).
|
|
|
|
E.g., for 2.0.1-SNAPSHOT, return (2, 0).
|
2018-05-01 22:55:01 -04:00
|
|
|
|
2018-05-08 09:22:54 -04:00
|
|
|
>>> sparkVersion = "2.4.0"
|
|
|
|
>>> VersionUtils.majorMinorVersion(sparkVersion)
|
|
|
|
(2, 4)
|
|
|
|
>>> sparkVersion = "2.3.0-SNAPSHOT"
|
|
|
|
>>> VersionUtils.majorMinorVersion(sparkVersion)
|
|
|
|
(2, 3)
|
2018-05-01 22:55:01 -04:00
|
|
|
|
2018-05-08 09:22:54 -04:00
|
|
|
"""
|
2018-09-12 23:19:43 -04:00
|
|
|
m = re.search(r'^(\d+)\.(\d+)(\..*)?$', sparkVersion)
|
2018-05-08 09:22:54 -04:00
|
|
|
if m is not None:
|
|
|
|
return (int(m.group(1)), int(m.group(2)))
|
|
|
|
else:
|
|
|
|
raise ValueError("Spark tried to parse '%s' as a Spark" % sparkVersion +
|
|
|
|
" version string, but it could not find the major and minor" +
|
|
|
|
" version numbers.")
|
2018-05-01 22:55:01 -04:00
|
|
|
|
|
|
|
|
2018-05-30 06:11:33 -04:00
|
|
|
def fail_on_stopiteration(f):
|
|
|
|
"""
|
|
|
|
Wraps the input function to fail on 'StopIteration' by raising a 'RuntimeError'
|
[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor
## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker
## How does this work?
The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
- In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
- In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.
## How was this patch tested?
Same tests, plus tests for pandas UDFs
Author: edorigatti <emilio.dorigatti@gmail.com>
Closes #21467 from e-dorigatti/fix_udf_hack.
2018-06-10 22:15:42 -04:00
|
|
|
prevents silent loss of data when 'f' is used in a for loop in Spark code
|
2018-05-30 06:11:33 -04:00
|
|
|
"""
|
|
|
|
def wrapper(*args, **kwargs):
|
|
|
|
try:
|
|
|
|
return f(*args, **kwargs)
|
|
|
|
except StopIteration as exc:
|
|
|
|
raise RuntimeError(
|
|
|
|
"Caught StopIteration thrown from user's code; failing the task",
|
|
|
|
exc
|
|
|
|
)
|
|
|
|
|
|
|
|
return wrapper
|
|
|
|
|
|
|
|
|
[SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode
### What changes were proposed in this pull request?
This PR proposes to show different warning message when the pinned thread mode is enabled:
When enabled:
> PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
When disabled:
> Currently, 'setLocalProperty' (set to local properties) with multiple threads does not properly work.
> Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
> To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
> To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
### Why are the changes needed?
Currently, it shows the same warning message regardless of PYSPARK_PIN_THREAD being set. In the warning message it says "you can set PYSPARK_PIN_THREAD to true ..." which is confusing.
### Does this PR introduce any user-facing change?
Documentation and warning message as shown above.
### How was this patch tested?
Manually tested.
```bash
$ PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
sc.setJobGroup("a", "b")
```
```
.../pyspark/util.py:141: UserWarning: PYSPARK_PIN_THREAD feature is enabled. However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
warnings.warn(msg, UserWarning)
```
```bash
$ ./bin/pyspark
```
```python
sc.setJobGroup("a", "b")
```
```
.../pyspark/util.py:141: UserWarning: Currently, 'setJobGroup' (set to local properties) with multiple threads does not properly work.
Internally threads on PVM and JVM are not synced, and JVM thread can be reused for multiple threads on PVM, which fails to isolate local properties for each thread on PVM.
To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). However, note that it cannot inherit the local properties from the parent thread although it isolates each thread on PVM and JVM with its own local properties.
To work around this, you should manually copy and set the local properties from the parent thread to the child thread when you create another thread.
warnings.warn(msg, UserWarning)
```
Closes #26588 from HyukjinKwon/SPARK-22340.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-11-20 20:54:01 -05:00
|
|
|
def _warn_pin_thread(name):
|
|
|
|
if os.environ.get("PYSPARK_PIN_THREAD", "false").lower() == "true":
|
|
|
|
msg = (
|
|
|
|
"PYSPARK_PIN_THREAD feature is enabled. "
|
|
|
|
"However, note that it cannot inherit the local properties from the parent thread "
|
|
|
|
"although it isolates each thread on PVM and JVM with its own local properties. "
|
|
|
|
"\n"
|
|
|
|
"To work around this, you should manually copy and set the local properties from "
|
|
|
|
"the parent thread to the child thread when you create another thread.")
|
|
|
|
else:
|
|
|
|
msg = (
|
|
|
|
"Currently, '%s' (set to local properties) with multiple threads does "
|
|
|
|
"not properly work. "
|
|
|
|
"\n"
|
|
|
|
"Internally threads on PVM and JVM are not synced, and JVM thread can be reused "
|
|
|
|
"for multiple threads on PVM, which fails to isolate local properties for each "
|
|
|
|
"thread on PVM. "
|
|
|
|
"\n"
|
|
|
|
"To work around this, you can set PYSPARK_PIN_THREAD to true (see SPARK-22340). "
|
|
|
|
"However, note that it cannot inherit the local properties from the parent thread "
|
|
|
|
"although it isolates each thread on PVM and JVM with its own local properties. "
|
|
|
|
"\n"
|
|
|
|
"To work around this, you should manually copy and set the local properties from "
|
|
|
|
"the parent thread to the child thread when you create another thread." % name)
|
|
|
|
warnings.warn(msg, UserWarning)
|
|
|
|
|
|
|
|
|
2019-03-10 21:15:07 -04:00
|
|
|
def _print_missing_jar(lib_name, pkg_name, jar_name, spark_version):
|
|
|
|
print("""
|
|
|
|
________________________________________________________________________________________________
|
|
|
|
|
|
|
|
Spark %(lib_name)s libraries not found in class path. Try one of the following.
|
|
|
|
|
|
|
|
1. Include the %(lib_name)s library and its dependencies with in the
|
|
|
|
spark-submit command as
|
|
|
|
|
|
|
|
$ bin/spark-submit --packages org.apache.spark:spark-%(pkg_name)s:%(spark_version)s ...
|
|
|
|
|
|
|
|
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
|
|
|
|
Group Id = org.apache.spark, Artifact Id = spark-%(jar_name)s, Version = %(spark_version)s.
|
|
|
|
Then, include the jar in the spark-submit command as
|
|
|
|
|
|
|
|
$ bin/spark-submit --jars <spark-%(jar_name)s.jar> ...
|
|
|
|
|
|
|
|
________________________________________________________________________________________________
|
|
|
|
|
|
|
|
""" % {
|
|
|
|
"lib_name": lib_name,
|
|
|
|
"pkg_name": pkg_name,
|
|
|
|
"jar_name": jar_name,
|
|
|
|
"spark_version": spark_version
|
|
|
|
})
|
|
|
|
|
|
|
|
|
2020-04-22 21:20:39 -04:00
|
|
|
def _parse_memory(s):
|
|
|
|
"""
|
|
|
|
Parse a memory string in the format supported by Java (e.g. 1g, 200m) and
|
|
|
|
return the value in MiB
|
|
|
|
|
|
|
|
>>> _parse_memory("256m")
|
|
|
|
256
|
|
|
|
>>> _parse_memory("2g")
|
|
|
|
2048
|
|
|
|
"""
|
|
|
|
units = {'g': 1024, 'm': 1, 't': 1 << 20, 'k': 1.0 / 1024}
|
|
|
|
if s[-1].lower() not in units:
|
|
|
|
raise ValueError("invalid format: " + s)
|
|
|
|
return int(float(s[:-1]) * units[s[-1].lower()])
|
|
|
|
|
2017-04-11 15:18:31 -04:00
|
|
|
if __name__ == "__main__":
|
|
|
|
import doctest
|
|
|
|
(failure_count, test_count) = doctest.testmod()
|
|
|
|
if failure_count:
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|