spark-instrumented-optimizer/python/pyspark
Bryan Cutler 77cc0d67d5 [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry
## What changes were proposed in this pull request?

When using PySpark broadcast variables in a multi-threaded environment,  `SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread.  This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.

## How was this patch tested?

Added a unit test that causes this race condition using another thread.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #18695 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717.
2017-08-02 07:12:23 +09:00
..
ml [SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold 2017-08-01 21:34:26 +08:00
mllib [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel 2017-05-24 22:55:38 +08:00
sql [SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark 2017-07-28 20:59:32 -07:00
streaming [SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds 2017-04-10 14:06:49 -07:00
__init__.py [MINOR] Fix some typo of the document 2017-06-19 20:35:58 +01:00
accumulators.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
broadcast.py [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry 2017-08-02 07:12:23 +09:00
cloudpickle.py [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 2017-04-11 12:18:31 -07:00
conf.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
context.py [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry 2017-08-02 07:12:23 +09:00
daemon.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
files.py
find_spark_home.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
heapq3.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
java_gateway.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
rdd.py [SPARK-21358][EXAMPLES] Argument of repartitionandsortwithinpartitions at pyspark 2017-07-10 18:56:54 -07:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resultiterable.py [SPARK-3074] [PySpark] support groupByKey() with single huge key 2015-04-09 17:07:23 -07:00
serializers.py [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas 2017-07-10 15:21:03 -07:00
shell.py [SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell 2017-04-12 10:54:50 -07:00
shuffle.py [SPARK-10710] Remove ability to disable spilling in core and SQL 2015-09-19 21:40:21 -07:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api 2016-04-12 23:06:55 -07:00
taskcontext.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
tests.py [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry 2017-08-02 07:12:23 +09:00
traceback_utils.py
util.py [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 2017-04-11 12:18:31 -07:00
version.py [MINOR] Bump SparkR and PySpark version to 2.3.0. 2017-06-19 11:13:03 +01:00
worker.py [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. 2017-05-10 16:50:57 -07:00