This fixes https://issues.apache.org/jira/browse/SPARK-1731 by adding the Python includes to the PYTHONPATH before depickling the broadcast values
@airhorns
Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
Closes#656 from bouk/python-includes-before-broadcast and squashes the following commits:
7b0dfe4 [Bouke van der Bijl] Add Python includes to path before depickling broadcast values
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason
@JoshRosen
Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
Closes#644 from bouk/catch-depickling-errors and squashes the following commits:
f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
Fixed minor typo in worker.py
Author: jyotiska <jyotiska123@gmail.com>
Closes#630 from jyotiska/pyspark_code and squashes the following commits:
ee44201 [jyotiska] typo fixed in worker.py
This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.
This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.
This helps in case the exception happened while serializing a record to
be sent to Java, leaving the stream to Java in an inconsistent state
where PythonRDD won't be able to read the error.
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers. Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.
This also fixes a bug in SparkContext.union().
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.