2017-04-11 15:18:31 -04:00
|
|
|
# -*- coding: utf-8 -*-
|
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
#
|
2018-03-08 06:29:07 -05:00
|
|
|
|
[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
### What changes were proposed in this pull request?
This PR proposes:
1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`.
This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below:
```python
import pyspark
spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
print(spark.sparkContext.getLocalProperty("a"))
pyspark.InheritableThread(target=print_prop).start()
```
```
hi
```
2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify:
```bash
PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
>>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
```
This issue is fixed now.
3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue.
### Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a proper inheritable local properties.
### Does this PR introduce _any_ user-facing change?
Yes, it adds an API `InheritableThread` class for pinned thread mode.
### How was this patch tested?
Manually tested as described above, and unit test was added as well.
Closes #28968 from HyukjinKwon/SPARK-32010.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-29 21:15:25 -04:00
|
|
|
import threading
|
2018-05-01 22:55:01 -04:00
|
|
|
import re
|
2018-03-08 06:29:07 -05:00
|
|
|
import sys
|
[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?
This patch upgrades cloudpickle to 1.0.0 version.
Main changes:
1. cleanup unused functions: https://github.com/cloudpipe/cloudpickle/commit/936f16fac89986453c4bb3a4af9f04da16d30a9a
2. Fix relative imports inside function body: https://github.com/cloudpipe/cloudpickle/commit/31ecdd6f57c6013a1affb21f69e86e638f463710
3. Write kw only arguments to pickle: https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502
### Why are the changes needed?
We should include new bug fix like https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502, because users might use such python function in PySpark.
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:> (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
process()
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
serializer.dump_stream(out_iter, outfile)
File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```
After:
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```
### Does this PR introduce any user-facing change?
Yes. This fixes two bugs when pickling Python functions.
### How was this patch tested?
Existing tests.
Closes #26009 from viirya/upgrade-cloudpickle.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 06:20:51 -04:00
|
|
|
import traceback
|
2017-04-11 15:18:31 -04:00
|
|
|
|
[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
### What changes were proposed in this pull request?
This PR proposes:
1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`.
This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below:
```python
import pyspark
spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
print(spark.sparkContext.getLocalProperty("a"))
pyspark.InheritableThread(target=print_prop).start()
```
```
hi
```
2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify:
```bash
PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
>>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
```
This issue is fixed now.
3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue.
### Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a proper inheritable local properties.
### Does this PR introduce _any_ user-facing change?
Yes, it adds an API `InheritableThread` class for pinned thread mode.
### How was this patch tested?
Manually tested as described above, and unit test was added as well.
Closes #28968 from HyukjinKwon/SPARK-32010.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-29 21:15:25 -04:00
|
|
|
from py4j.clientserver import ClientServer
|
|
|
|
|
2017-04-11 15:18:31 -04:00
|
|
|
__all__ = []
|
|
|
|
|
|
|
|
|
[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0
### What changes were proposed in this pull request?
This patch upgrades cloudpickle to 1.0.0 version.
Main changes:
1. cleanup unused functions: https://github.com/cloudpipe/cloudpickle/commit/936f16fac89986453c4bb3a4af9f04da16d30a9a
2. Fix relative imports inside function body: https://github.com/cloudpipe/cloudpickle/commit/31ecdd6f57c6013a1affb21f69e86e638f463710
3. Write kw only arguments to pickle: https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502
### Why are the changes needed?
We should include new bug fix like https://github.com/cloudpipe/cloudpickle/commit/6cb47185284548d5706beccd69f172586d127502, because users might use such python function in PySpark.
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[Stage 0:> (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main
process()
File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process
serializer.dump_stream(out_iter, outfile)
File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
TypeError: f() missing 1 required keyword-only argument: 'b'
```
After:
```python
>>> def f(a, *, b=1):
... return a + b
...
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.map(f).collect()
[2, 3, 4]
```
### Does this PR introduce any user-facing change?
Yes. This fixes two bugs when pickling Python functions.
### How was this patch tested?
Existing tests.
Closes #26009 from viirya/upgrade-cloudpickle.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-10-03 06:20:51 -04:00
|
|
|
def print_exec(stream):
|
|
|
|
ei = sys.exc_info()
|
|
|
|
traceback.print_exception(ei[0], ei[1], ei[2], None, stream)
|
|
|
|
|
|
|
|
|
2018-05-08 09:22:54 -04:00
|
|
|
class VersionUtils(object):
|
2018-05-01 22:55:01 -04:00
|
|
|
"""
|
2018-05-08 09:22:54 -04:00
|
|
|
Provides utility method to determine Spark versions with given input string.
|
|
|
|
"""
|
|
|
|
@staticmethod
|
|
|
|
def majorMinorVersion(sparkVersion):
|
|
|
|
"""
|
|
|
|
Given a Spark version string, return the (major version number, minor version number).
|
|
|
|
E.g., for 2.0.1-SNAPSHOT, return (2, 0).
|
2018-05-01 22:55:01 -04:00
|
|
|
|
2018-05-08 09:22:54 -04:00
|
|
|
>>> sparkVersion = "2.4.0"
|
|
|
|
>>> VersionUtils.majorMinorVersion(sparkVersion)
|
|
|
|
(2, 4)
|
|
|
|
>>> sparkVersion = "2.3.0-SNAPSHOT"
|
|
|
|
>>> VersionUtils.majorMinorVersion(sparkVersion)
|
|
|
|
(2, 3)
|
2018-05-01 22:55:01 -04:00
|
|
|
|
2018-05-08 09:22:54 -04:00
|
|
|
"""
|
2018-09-12 23:19:43 -04:00
|
|
|
m = re.search(r'^(\d+)\.(\d+)(\..*)?$', sparkVersion)
|
2018-05-08 09:22:54 -04:00
|
|
|
if m is not None:
|
|
|
|
return (int(m.group(1)), int(m.group(2)))
|
|
|
|
else:
|
|
|
|
raise ValueError("Spark tried to parse '%s' as a Spark" % sparkVersion +
|
|
|
|
" version string, but it could not find the major and minor" +
|
|
|
|
" version numbers.")
|
2018-05-01 22:55:01 -04:00
|
|
|
|
|
|
|
|
2018-05-30 06:11:33 -04:00
|
|
|
def fail_on_stopiteration(f):
|
|
|
|
"""
|
|
|
|
Wraps the input function to fail on 'StopIteration' by raising a 'RuntimeError'
|
[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor
## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker
## How does this work?
The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
- In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
- In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.
## How was this patch tested?
Same tests, plus tests for pandas UDFs
Author: edorigatti <emilio.dorigatti@gmail.com>
Closes #21467 from e-dorigatti/fix_udf_hack.
2018-06-10 22:15:42 -04:00
|
|
|
prevents silent loss of data when 'f' is used in a for loop in Spark code
|
2018-05-30 06:11:33 -04:00
|
|
|
"""
|
|
|
|
def wrapper(*args, **kwargs):
|
|
|
|
try:
|
|
|
|
return f(*args, **kwargs)
|
|
|
|
except StopIteration as exc:
|
|
|
|
raise RuntimeError(
|
|
|
|
"Caught StopIteration thrown from user's code; failing the task",
|
|
|
|
exc
|
|
|
|
)
|
|
|
|
|
|
|
|
return wrapper
|
|
|
|
|
|
|
|
|
2019-03-10 21:15:07 -04:00
|
|
|
def _print_missing_jar(lib_name, pkg_name, jar_name, spark_version):
|
|
|
|
print("""
|
|
|
|
________________________________________________________________________________________________
|
|
|
|
|
|
|
|
Spark %(lib_name)s libraries not found in class path. Try one of the following.
|
|
|
|
|
|
|
|
1. Include the %(lib_name)s library and its dependencies with in the
|
|
|
|
spark-submit command as
|
|
|
|
|
|
|
|
$ bin/spark-submit --packages org.apache.spark:spark-%(pkg_name)s:%(spark_version)s ...
|
|
|
|
|
|
|
|
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
|
|
|
|
Group Id = org.apache.spark, Artifact Id = spark-%(jar_name)s, Version = %(spark_version)s.
|
|
|
|
Then, include the jar in the spark-submit command as
|
|
|
|
|
|
|
|
$ bin/spark-submit --jars <spark-%(jar_name)s.jar> ...
|
|
|
|
|
|
|
|
________________________________________________________________________________________________
|
|
|
|
|
|
|
|
""" % {
|
|
|
|
"lib_name": lib_name,
|
|
|
|
"pkg_name": pkg_name,
|
|
|
|
"jar_name": jar_name,
|
|
|
|
"spark_version": spark_version
|
|
|
|
})
|
|
|
|
|
|
|
|
|
2020-04-22 21:20:39 -04:00
|
|
|
def _parse_memory(s):
|
|
|
|
"""
|
|
|
|
Parse a memory string in the format supported by Java (e.g. 1g, 200m) and
|
|
|
|
return the value in MiB
|
|
|
|
|
|
|
|
>>> _parse_memory("256m")
|
|
|
|
256
|
|
|
|
>>> _parse_memory("2g")
|
|
|
|
2048
|
|
|
|
"""
|
|
|
|
units = {'g': 1024, 'm': 1, 't': 1 << 20, 'k': 1.0 / 1024}
|
|
|
|
if s[-1].lower() not in units:
|
|
|
|
raise ValueError("invalid format: " + s)
|
|
|
|
return int(float(s[:-1]) * units[s[-1].lower()])
|
|
|
|
|
[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
### What changes were proposed in this pull request?
This PR proposes:
1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`.
This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below:
```python
import pyspark
spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
print(spark.sparkContext.getLocalProperty("a"))
pyspark.InheritableThread(target=print_prop).start()
```
```
hi
```
2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify:
```bash
PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
>>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
```
This issue is fixed now.
3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue.
### Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a proper inheritable local properties.
### Does this PR introduce _any_ user-facing change?
Yes, it adds an API `InheritableThread` class for pinned thread mode.
### How was this patch tested?
Manually tested as described above, and unit test was added as well.
Closes #28968 from HyukjinKwon/SPARK-32010.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-29 21:15:25 -04:00
|
|
|
|
|
|
|
class InheritableThread(threading.Thread):
|
|
|
|
"""
|
|
|
|
Thread that is recommended to be used in PySpark instead of :class:`threading.Thread`
|
|
|
|
when the pinned thread mode is enabled. The usage of this class is exactly same as
|
|
|
|
:class:`threading.Thread` but correctly inherits the inheritable properties specific
|
|
|
|
to JVM thread such as ``InheritableThreadLocal``.
|
|
|
|
|
|
|
|
Also, note that pinned thread mode does not close the connection from Python
|
|
|
|
to JVM when the thread is finished in the Python side. With this class, Python
|
|
|
|
garbage-collects the Python thread instance and also closes the connection
|
|
|
|
which finishes JVM thread correctly.
|
|
|
|
|
|
|
|
When the pinned thread mode is off, this works as :class:`threading.Thread`.
|
|
|
|
|
|
|
|
.. note:: Experimental
|
|
|
|
|
|
|
|
.. versionadded:: 3.1.0
|
|
|
|
"""
|
|
|
|
def __init__(self, target, *args, **kwargs):
|
|
|
|
from pyspark import SparkContext
|
|
|
|
|
|
|
|
sc = SparkContext._active_spark_context
|
|
|
|
|
|
|
|
if isinstance(sc._gateway, ClientServer):
|
|
|
|
# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on.
|
|
|
|
properties = sc._jsc.sc().getLocalProperties().clone()
|
|
|
|
self._sc = sc
|
|
|
|
|
|
|
|
def copy_local_properties(*a, **k):
|
|
|
|
sc._jsc.sc().setLocalProperties(properties)
|
|
|
|
return target(*a, **k)
|
|
|
|
|
|
|
|
super(InheritableThread, self).__init__(
|
|
|
|
target=copy_local_properties, *args, **kwargs)
|
|
|
|
else:
|
|
|
|
super(InheritableThread, self).__init__(target=target, *args, **kwargs)
|
|
|
|
|
|
|
|
def __del__(self):
|
|
|
|
from pyspark import SparkContext
|
|
|
|
|
|
|
|
if isinstance(SparkContext._gateway, ClientServer):
|
|
|
|
thread_connection = self._sc._jvm._gateway_client.thread_connection.connection()
|
|
|
|
if thread_connection is not None:
|
|
|
|
connections = self._sc._jvm._gateway_client.deque
|
|
|
|
|
|
|
|
# Reuse the lock for Py4J in PySpark
|
|
|
|
with SparkContext._lock:
|
|
|
|
for i in range(len(connections)):
|
|
|
|
if connections[i] is thread_connection:
|
|
|
|
connections[i].close()
|
|
|
|
del connections[i]
|
|
|
|
break
|
|
|
|
else:
|
|
|
|
# Just in case the connection was not closed but removed from the queue.
|
|
|
|
thread_connection.close()
|
|
|
|
|
|
|
|
|
2017-04-11 15:18:31 -04:00
|
|
|
if __name__ == "__main__":
|
|
|
|
import doctest
|
|
|
|
(failure_count, test_count) = doctest.testmod()
|
|
|
|
if failure_count:
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|