The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.
Author: Matei Zaharia <matei@databricks.com>
Closes#335 from mateiz/mllib-python-2.6 and squashes the following commits:
f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
Also clarified comment on each file having to fit in memory
Author: Matei Zaharia <matei@databricks.com>
Closes#327 from mateiz/py-whole-files and squashes the following commits:
9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
was raised earlier as a part of apache/incubator-spark#486
Author: Prabin Banka <prabin.banka@imaginea.com>
Closes#76 from prabinb/python-api-zip and squashes the following commits:
b1a31a0 [Prabin Banka] Added Python RDD.zip function
This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.
This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers. Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.
This also fixes a bug in SparkContext.union().
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.