spark-instrumented-optimizer/python/pyspark
Matei Zaharia c344ed04c7 Merge pull request #283 from tmyklebu/master
Python bindings for mllib

This pull request contains Python bindings for the regression, clustering, classification, and recommendation tools in mllib.

For each 'train' frontend exposed, there is a Scala stub in PythonMLLibAPI.scala and a Python stub in mllib.py.  The Python stub serialises the input RDD and any vector/matrix arguments into a mutually-understood format and calls the Scala stub.  The Scala stub deserialises the RDD and the vector/matrix arguments, calls the appropriate 'train' function, serialises the resulting model, and returns the serialised model.

ALSModel is slightly different since a MatrixFactorizationModel has RDDs inside.  The Scala stub returns a handle to a Scala MatrixFactorizationModel; prediction is done by calling the Scala predict method.

I have tested these bindings on an x86_64 machine running Linux.  There is a risk that these bindings may fail on some choose-your-own-endian platform if Python's endian differs from java.nio.ByteBuffer's idea of the native byte order.
2013-12-26 01:31:06 -05:00
..
mllib Remove commented code in __init__.py. 2013-12-25 14:12:42 -05:00
__init__.py Split the mllib bindings into a whole bunch of modules and rename some things. 2013-12-25 00:08:05 -05:00
accumulators.py Add custom serializer support to PySpark. 2013-11-10 16:45:38 -08:00
broadcast.py Add Apache license headers and LICENSE and NOTICE files 2013-07-16 17:21:33 -07:00
cloudpickle.py Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
context.py Add collectPartition to JavaRDD interface. 2013-12-18 11:40:07 -08:00
daemon.py Add Apache license headers and LICENSE and NOTICE files 2013-07-16 17:21:33 -07:00
files.py Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
java_gateway.py Python change for move of PythonMLLibAPI. 2013-12-24 16:49:03 -05:00
join.py Change numSplits to numPartitions in PySpark. 2013-02-24 13:25:09 -08:00
rdd.py Merge pull request #276 from shivaram/collectPartition 2013-12-19 13:35:09 -08:00
rddsampler.py RDD sample() and takeSample() prototypes for PySpark 2013-08-28 16:46:13 -07:00
serializers.py The rest of the Python side of those bindings. 2013-12-19 01:29:51 -05:00
shell.py Typo: avaiable -> available 2013-12-24 17:25:04 -08:00
statcounter.py Implementing SPARK-838: Add DoubleRDDFunctions methods to PySpark 2013-08-21 17:05:58 -07:00
storagelevel.py Export StorageLevel and refactor 2013-09-07 14:41:31 -07:00
tests.py Fix UnicodeEncodeError in PySpark saveAsTextFile(). 2013-11-28 23:44:56 -08:00
worker.py FramedSerializer: _dumps => dumps, _loads => loads. 2013-11-10 17:53:25 -08:00