[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
#
|
|
|
|
|
|
|
|
"""
|
|
|
|
MLlib utilities for linear algebra. For dense vectors, MLlib
|
|
|
|
uses the NumPy C{array} type, so you can simply pass NumPy arrays
|
|
|
|
around. For sparse vectors, users can construct a L{SparseVector}
|
|
|
|
object from MLlib or pass SciPy C{scipy.sparse} column vectors if
|
|
|
|
SciPy is available in their environment.
|
|
|
|
"""
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
import sys
|
|
|
|
import array
|
2015-04-16 19:20:57 -04:00
|
|
|
|
|
|
|
if sys.version >= '3':
|
|
|
|
basestring = str
|
|
|
|
xrange = range
|
|
|
|
import copyreg as copy_reg
|
|
|
|
else:
|
|
|
|
import copy_reg
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
import numpy as np
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2015-02-09 23:49:22 -05:00
|
|
|
from pyspark.sql.types import UserDefinedType, StructField, StructType, ArrayType, DoubleType, \
|
2014-11-24 19:37:14 -05:00
|
|
|
IntegerType, ByteType
|
2014-11-04 01:29:48 -05:00
|
|
|
|
2014-10-16 17:56:50 -04:00
|
|
|
|
2015-04-29 00:15:47 -04:00
|
|
|
__all__ = ['Vector', 'DenseVector', 'SparseVector', 'Vectors',
|
|
|
|
'Matrix', 'DenseMatrix', 'SparseMatrix', 'Matrices']
|
2014-09-03 14:49:45 -04:00
|
|
|
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
if sys.version_info[:2] == (2, 7):
|
|
|
|
# speed up pickling array in Python 2.7
|
|
|
|
def fast_pickle_array(ar):
|
|
|
|
return array.array, (ar.typecode, ar.tostring())
|
|
|
|
copy_reg.pickle(array.array, fast_pickle_array)
|
|
|
|
|
|
|
|
|
|
|
|
# Check whether we have SciPy. MLlib works without it too, but if we have it, some methods,
|
|
|
|
# such as _dot and _serialize_double_vector, start to support scipy.sparse matrices.
|
|
|
|
|
|
|
|
try:
|
|
|
|
import scipy.sparse
|
|
|
|
_have_scipy = True
|
|
|
|
except:
|
|
|
|
# No SciPy in environment, but that's okay
|
|
|
|
_have_scipy = False
|
|
|
|
|
|
|
|
|
|
|
|
def _convert_to_vector(l):
|
|
|
|
if isinstance(l, Vector):
|
|
|
|
return l
|
2015-04-16 19:20:57 -04:00
|
|
|
elif type(l) in (array.array, np.array, np.ndarray, list, tuple, xrange):
|
2014-09-19 18:01:11 -04:00
|
|
|
return DenseVector(l)
|
|
|
|
elif _have_scipy and scipy.sparse.issparse(l):
|
|
|
|
assert l.shape[1] == 1, "Expected column vector"
|
|
|
|
csc = l.tocsc()
|
|
|
|
return SparseVector(l.shape[0], csc.indices, csc.data)
|
|
|
|
else:
|
|
|
|
raise TypeError("Cannot convert type %s into Vector" % type(l))
|
|
|
|
|
|
|
|
|
2014-09-30 20:10:36 -04:00
|
|
|
def _vector_size(v):
|
|
|
|
"""
|
|
|
|
Returns the size of the vector.
|
|
|
|
|
|
|
|
>>> _vector_size([1., 2., 3.])
|
|
|
|
3
|
|
|
|
>>> _vector_size((1., 2., 3.))
|
|
|
|
3
|
|
|
|
>>> _vector_size(array.array('d', [1., 2., 3.]))
|
|
|
|
3
|
|
|
|
>>> _vector_size(np.zeros(3))
|
|
|
|
3
|
|
|
|
>>> _vector_size(np.zeros((3, 1)))
|
|
|
|
3
|
|
|
|
>>> _vector_size(np.zeros((1, 3)))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
ValueError: Cannot treat an ndarray of shape (1, 3) as a vector
|
|
|
|
"""
|
|
|
|
if isinstance(v, Vector):
|
|
|
|
return len(v)
|
2015-04-16 19:20:57 -04:00
|
|
|
elif type(v) in (array.array, list, tuple, xrange):
|
2014-09-30 20:10:36 -04:00
|
|
|
return len(v)
|
|
|
|
elif type(v) == np.ndarray:
|
|
|
|
if v.ndim == 1 or (v.ndim == 2 and v.shape[1] == 1):
|
|
|
|
return len(v)
|
|
|
|
else:
|
|
|
|
raise ValueError("Cannot treat an ndarray of shape %s as a vector" % str(v.shape))
|
|
|
|
elif _have_scipy and scipy.sparse.issparse(v):
|
|
|
|
assert v.shape[1] == 1, "Expected column vector"
|
|
|
|
return v.shape[0]
|
|
|
|
else:
|
|
|
|
raise TypeError("Cannot treat type %s as a vector" % type(v))
|
|
|
|
|
|
|
|
|
2014-10-28 06:50:22 -04:00
|
|
|
def _format_float(f, digits=4):
|
|
|
|
s = str(round(f, digits))
|
|
|
|
if '.' in s:
|
|
|
|
s = s[:s.index('.') + 1 + digits]
|
|
|
|
return s
|
|
|
|
|
|
|
|
|
2014-11-04 01:29:48 -05:00
|
|
|
class VectorUDT(UserDefinedType):
|
|
|
|
"""
|
|
|
|
SQL user-defined type (UDT) for Vector.
|
|
|
|
"""
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def sqlType(cls):
|
|
|
|
return StructType([
|
|
|
|
StructField("type", ByteType(), False),
|
|
|
|
StructField("size", IntegerType(), True),
|
|
|
|
StructField("indices", ArrayType(IntegerType(), False), True),
|
|
|
|
StructField("values", ArrayType(DoubleType(), False), True)])
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def module(cls):
|
|
|
|
return "pyspark.mllib.linalg"
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def scalaUDT(cls):
|
|
|
|
return "org.apache.spark.mllib.linalg.VectorUDT"
|
|
|
|
|
|
|
|
def serialize(self, obj):
|
|
|
|
if isinstance(obj, SparseVector):
|
|
|
|
indices = [int(i) for i in obj.indices]
|
|
|
|
values = [float(v) for v in obj.values]
|
|
|
|
return (0, obj.size, indices, values)
|
|
|
|
elif isinstance(obj, DenseVector):
|
|
|
|
values = [float(v) for v in obj]
|
|
|
|
return (1, None, None, values)
|
|
|
|
else:
|
2015-04-20 13:44:09 -04:00
|
|
|
raise TypeError("cannot serialize %r of type %r" % (obj, type(obj)))
|
2014-11-04 01:29:48 -05:00
|
|
|
|
|
|
|
def deserialize(self, datum):
|
|
|
|
assert len(datum) == 4, \
|
|
|
|
"VectorUDT.deserialize given row with length %d but requires 4" % len(datum)
|
|
|
|
tpe = datum[0]
|
|
|
|
if tpe == 0:
|
|
|
|
return SparseVector(datum[1], datum[2], datum[3])
|
|
|
|
elif tpe == 1:
|
|
|
|
return DenseVector(datum[3])
|
|
|
|
else:
|
|
|
|
raise ValueError("do not recognize type %r" % tpe)
|
|
|
|
|
2015-03-02 20:14:34 -05:00
|
|
|
def simpleString(self):
|
|
|
|
return "vector"
|
|
|
|
|
2014-11-04 01:29:48 -05:00
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
class Vector(object):
|
2014-11-04 01:29:48 -05:00
|
|
|
|
|
|
|
__UDT__ = VectorUDT()
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
|
|
|
Abstract class for DenseVector and SparseVector
|
|
|
|
"""
|
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Convert the vector into an numpy.ndarray
|
|
|
|
:return: numpy.ndarray
|
|
|
|
"""
|
|
|
|
raise NotImplementedError
|
|
|
|
|
|
|
|
|
|
|
|
class DenseVector(Vector):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
2015-04-01 16:29:04 -04:00
|
|
|
A dense vector represented by a value array. We use numpy array for
|
|
|
|
storage and arithmetics will be delegated to the underlying numpy
|
|
|
|
array.
|
|
|
|
|
|
|
|
>>> v = Vectors.dense([1.0, 2.0])
|
|
|
|
>>> u = Vectors.dense([3.0, 4.0])
|
|
|
|
>>> v + u
|
|
|
|
DenseVector([4.0, 6.0])
|
|
|
|
>>> 2 - v
|
|
|
|
DenseVector([1.0, 0.0])
|
|
|
|
>>> v / 2
|
|
|
|
DenseVector([0.5, 1.0])
|
|
|
|
>>> v * u
|
|
|
|
DenseVector([3.0, 8.0])
|
|
|
|
>>> u / v
|
|
|
|
DenseVector([3.0, 2.0])
|
|
|
|
>>> u % 2
|
|
|
|
DenseVector([1.0, 0.0])
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
def __init__(self, ar):
|
2015-04-16 19:20:57 -04:00
|
|
|
if isinstance(ar, bytes):
|
2014-11-24 19:37:14 -05:00
|
|
|
ar = np.frombuffer(ar, dtype=np.float64)
|
|
|
|
elif not isinstance(ar, np.ndarray):
|
|
|
|
ar = np.array(ar, dtype=np.float64)
|
|
|
|
if ar.dtype != np.float64:
|
2015-01-05 16:10:59 -05:00
|
|
|
ar = ar.astype(np.float64)
|
2014-09-19 18:01:11 -04:00
|
|
|
self.array = ar
|
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
@staticmethod
|
|
|
|
def parse(s):
|
|
|
|
"""
|
|
|
|
Parse string representation back into the DenseVector.
|
|
|
|
|
|
|
|
>>> DenseVector.parse(' [ 0.0,1.0,2.0, 3.0]')
|
|
|
|
DenseVector([0.0, 1.0, 2.0, 3.0])
|
|
|
|
"""
|
|
|
|
start = s.find('[')
|
|
|
|
if start == -1:
|
|
|
|
raise ValueError("Array should start with '['.")
|
|
|
|
end = s.find(']')
|
|
|
|
if end == -1:
|
|
|
|
raise ValueError("Array should end with ']'.")
|
|
|
|
s = s[start + 1: end]
|
|
|
|
|
|
|
|
try:
|
|
|
|
values = [float(val) for val in s.split(',')]
|
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Unable to parse values from %s" % s)
|
|
|
|
return DenseVector(values)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __reduce__(self):
|
2014-11-24 19:37:14 -05:00
|
|
|
return DenseVector, (self.array.tostring(),)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
def numNonzeros(self):
|
|
|
|
return np.count_nonzero(self.array)
|
|
|
|
|
|
|
|
def norm(self, p):
|
|
|
|
"""
|
|
|
|
Calculte the norm of a DenseVector.
|
|
|
|
|
|
|
|
>>> a = DenseVector([0, -1, 2, -3])
|
|
|
|
>>> a.norm(2)
|
|
|
|
3.7...
|
|
|
|
>>> a.norm(1)
|
|
|
|
6.0
|
|
|
|
"""
|
|
|
|
return np.linalg.norm(self.array, p)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def dot(self, other):
|
|
|
|
"""
|
|
|
|
Compute the dot product of two Vectors. We support
|
|
|
|
(Numpy array, list, SparseVector, or SciPy sparse)
|
|
|
|
and a target NumPy array that is either 1- or 2-dimensional.
|
|
|
|
Equivalent to calling numpy.dot of the two vectors.
|
|
|
|
|
|
|
|
>>> dense = DenseVector(array.array('d', [1., 2.]))
|
|
|
|
>>> dense.dot(dense)
|
|
|
|
5.0
|
|
|
|
>>> dense.dot(SparseVector(2, [0, 1], [2., 1.]))
|
|
|
|
4.0
|
|
|
|
>>> dense.dot(range(1, 3))
|
|
|
|
5.0
|
|
|
|
>>> dense.dot(np.array(range(1, 3)))
|
|
|
|
5.0
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> dense.dot([1.,])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F'))
|
|
|
|
array([ 5., 11.])
|
|
|
|
>>> dense.dot(np.reshape([1., 2., 3.], (3, 1), order='F'))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
2014-11-24 19:37:14 -05:00
|
|
|
if type(other) == np.ndarray:
|
|
|
|
if other.ndim > 1:
|
|
|
|
assert len(self) == other.shape[0], "dimension mismatch"
|
|
|
|
return np.dot(self.array, other)
|
2014-09-19 18:01:11 -04:00
|
|
|
elif _have_scipy and scipy.sparse.issparse(other):
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == other.shape[0], "dimension mismatch"
|
|
|
|
return other.transpose().dot(self.toArray())
|
2014-09-19 18:01:11 -04:00
|
|
|
else:
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
|
|
|
if isinstance(other, SparseVector):
|
|
|
|
return other.dot(self)
|
|
|
|
elif isinstance(other, Vector):
|
|
|
|
return np.dot(self.toArray(), other.toArray())
|
|
|
|
else:
|
|
|
|
return np.dot(self.toArray(), other)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def squared_distance(self, other):
|
|
|
|
"""
|
|
|
|
Squared distance of two Vectors.
|
|
|
|
|
|
|
|
>>> dense1 = DenseVector(array.array('d', [1., 2.]))
|
|
|
|
>>> dense1.squared_distance(dense1)
|
|
|
|
0.0
|
|
|
|
>>> dense2 = np.array([2., 1.])
|
|
|
|
>>> dense1.squared_distance(dense2)
|
|
|
|
2.0
|
|
|
|
>>> dense3 = [2., 1.]
|
|
|
|
>>> dense1.squared_distance(dense3)
|
|
|
|
2.0
|
|
|
|
>>> sparse1 = SparseVector(2, [0, 1], [2., 1.])
|
|
|
|
>>> dense1.squared_distance(sparse1)
|
|
|
|
2.0
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> dense1.squared_distance([1.,])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> dense1.squared_distance(SparseVector(1, [0,], [1.,]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
2014-09-19 18:01:11 -04:00
|
|
|
if isinstance(other, SparseVector):
|
|
|
|
return other.squared_distance(self)
|
|
|
|
elif _have_scipy and scipy.sparse.issparse(other):
|
|
|
|
return _convert_to_vector(other).squared_distance(self)
|
|
|
|
|
|
|
|
if isinstance(other, Vector):
|
|
|
|
other = other.toArray()
|
|
|
|
elif not isinstance(other, np.ndarray):
|
|
|
|
other = np.array(other)
|
|
|
|
diff = self.toArray() - other
|
|
|
|
return np.dot(diff, diff)
|
|
|
|
|
|
|
|
def toArray(self):
|
2014-11-24 19:37:14 -05:00
|
|
|
return self.array
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def __getitem__(self, item):
|
|
|
|
return self.array[item]
|
|
|
|
|
|
|
|
def __len__(self):
|
|
|
|
return len(self.array)
|
|
|
|
|
|
|
|
def __str__(self):
|
|
|
|
return "[" + ",".join([str(v) for v in self.array]) + "]"
|
|
|
|
|
|
|
|
def __repr__(self):
|
2014-10-28 06:50:22 -04:00
|
|
|
return "DenseVector([%s])" % (', '.join(_format_float(i) for i in self.array))
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def __eq__(self, other):
|
2014-11-24 19:37:14 -05:00
|
|
|
return isinstance(other, DenseVector) and np.array_equal(self.array, other.array)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def __ne__(self, other):
|
|
|
|
return not self == other
|
|
|
|
|
|
|
|
def __getattr__(self, item):
|
|
|
|
return getattr(self.array, item)
|
|
|
|
|
2015-04-01 16:29:04 -04:00
|
|
|
def _delegate(op):
|
|
|
|
def func(self, other):
|
|
|
|
if isinstance(other, DenseVector):
|
|
|
|
other = other.array
|
|
|
|
return DenseVector(getattr(self.array, op)(other))
|
|
|
|
return func
|
|
|
|
|
|
|
|
__neg__ = _delegate("__neg__")
|
|
|
|
__add__ = _delegate("__add__")
|
|
|
|
__sub__ = _delegate("__sub__")
|
|
|
|
__mul__ = _delegate("__mul__")
|
|
|
|
__div__ = _delegate("__div__")
|
2015-04-16 19:20:57 -04:00
|
|
|
__truediv__ = _delegate("__truediv__")
|
2015-04-01 16:29:04 -04:00
|
|
|
__mod__ = _delegate("__mod__")
|
|
|
|
__radd__ = _delegate("__radd__")
|
|
|
|
__rsub__ = _delegate("__rsub__")
|
|
|
|
__rmul__ = _delegate("__rmul__")
|
|
|
|
__rdiv__ = _delegate("__rdiv__")
|
2015-04-16 19:20:57 -04:00
|
|
|
__rtruediv__ = _delegate("__rtruediv__")
|
2015-04-01 16:29:04 -04:00
|
|
|
__rmod__ = _delegate("__rmod__")
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class SparseVector(Vector):
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
|
|
|
A simple sparse vector class for passing data to MLlib. Users may
|
|
|
|
alternatively pass SciPy's {scipy.sparse} data types.
|
|
|
|
"""
|
|
|
|
def __init__(self, size, *args):
|
|
|
|
"""
|
|
|
|
Create a sparse vector, using either a dictionary, a list of
|
|
|
|
(index, value) pairs, or two separate arrays of indices and
|
|
|
|
values (sorted by index).
|
|
|
|
|
2014-10-07 21:09:27 -04:00
|
|
|
:param size: Size of the vector.
|
|
|
|
:param args: Non-zero entries, as a dictionary, list of tupes,
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
or two sorted lists containing indices and values.
|
|
|
|
|
2015-04-16 19:20:57 -04:00
|
|
|
>>> SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> SparseVector(4, [(1, 1.0), (3, 5.5)])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> SparseVector(4, [1, 3], [1.0, 5.5])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2014-05-07 19:01:11 -04:00
|
|
|
self.size = int(size)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
assert 1 <= len(args) <= 2, "must pass either 2 or 3 arguments"
|
|
|
|
if len(args) == 1:
|
|
|
|
pairs = args[0]
|
|
|
|
if type(pairs) == dict:
|
2014-05-25 20:15:01 -04:00
|
|
|
pairs = pairs.items()
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
pairs = sorted(pairs)
|
2014-11-24 19:37:14 -05:00
|
|
|
self.indices = np.array([p[0] for p in pairs], dtype=np.int32)
|
|
|
|
self.values = np.array([p[1] for p in pairs], dtype=np.float64)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
else:
|
2015-04-16 19:20:57 -04:00
|
|
|
if isinstance(args[0], bytes):
|
|
|
|
assert isinstance(args[1], bytes), "values should be string too"
|
2014-11-24 19:37:14 -05:00
|
|
|
if args[0]:
|
|
|
|
self.indices = np.frombuffer(args[0], np.int32)
|
|
|
|
self.values = np.frombuffer(args[1], np.float64)
|
|
|
|
else:
|
|
|
|
# np.frombuffer() doesn't work well with empty string in older version
|
|
|
|
self.indices = np.array([], dtype=np.int32)
|
|
|
|
self.values = np.array([], dtype=np.float64)
|
|
|
|
else:
|
|
|
|
self.indices = np.array(args[0], dtype=np.int32)
|
|
|
|
self.values = np.array(args[1], dtype=np.float64)
|
|
|
|
assert len(self.indices) == len(self.values), "index and value arrays not same length"
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
for i in xrange(len(self.indices) - 1):
|
|
|
|
if self.indices[i] >= self.indices[i + 1]:
|
|
|
|
raise TypeError("indices array must be sorted")
|
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
def numNonzeros(self):
|
|
|
|
return np.count_nonzero(self.values)
|
|
|
|
|
|
|
|
def norm(self, p):
|
|
|
|
"""
|
|
|
|
Calculte the norm of a SparseVector.
|
|
|
|
|
|
|
|
>>> a = SparseVector(4, [0, 1], [3., -4.])
|
|
|
|
>>> a.norm(1)
|
|
|
|
7.0
|
|
|
|
>>> a.norm(2)
|
|
|
|
5.0
|
|
|
|
"""
|
|
|
|
return np.linalg.norm(self.values, p)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __reduce__(self):
|
2015-05-07 17:02:05 -04:00
|
|
|
return (
|
|
|
|
SparseVector,
|
|
|
|
(self.size, self.indices.tostring(), self.values.tostring()))
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def parse(s):
|
|
|
|
"""
|
|
|
|
Parse string representation back into the DenseVector.
|
|
|
|
|
|
|
|
>>> SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] )')
|
|
|
|
SparseVector(4, {0: 4.0, 1: 5.0})
|
|
|
|
"""
|
|
|
|
start = s.find('(')
|
|
|
|
if start == -1:
|
|
|
|
raise ValueError("Tuple should start with '('")
|
|
|
|
end = s.find(')')
|
|
|
|
if start == -1:
|
|
|
|
raise ValueError("Tuple should end with ')'")
|
|
|
|
s = s[start + 1: end].strip()
|
|
|
|
|
|
|
|
size = s[: s.find(',')]
|
|
|
|
try:
|
|
|
|
size = int(size)
|
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Cannot parse size %s." % size)
|
|
|
|
|
|
|
|
ind_start = s.find('[')
|
|
|
|
if ind_start == -1:
|
|
|
|
raise ValueError("Indices array should start with '['.")
|
|
|
|
ind_end = s.find(']')
|
|
|
|
if ind_end == -1:
|
|
|
|
raise ValueError("Indices array should end with ']'")
|
|
|
|
new_s = s[ind_start + 1: ind_end]
|
|
|
|
ind_list = new_s.split(',')
|
|
|
|
try:
|
|
|
|
indices = [int(ind) for ind in ind_list]
|
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Unable to parse indices from %s." % new_s)
|
|
|
|
s = s[ind_end + 1:].strip()
|
|
|
|
|
|
|
|
val_start = s.find('[')
|
|
|
|
if val_start == -1:
|
|
|
|
raise ValueError("Values array should start with '['.")
|
|
|
|
val_end = s.find(']')
|
|
|
|
if val_end == -1:
|
|
|
|
raise ValueError("Values array should end with ']'.")
|
|
|
|
val_list = s[val_start + 1: val_end].split(',')
|
|
|
|
try:
|
|
|
|
values = [float(val) for val in val_list]
|
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Unable to parse values from %s." % s)
|
|
|
|
return SparseVector(size, indices, values)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def dot(self, other):
|
|
|
|
"""
|
|
|
|
Dot product with a SparseVector or 1- or 2-dimensional Numpy array.
|
|
|
|
|
|
|
|
>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
|
|
|
|
>>> a.dot(a)
|
|
|
|
25.0
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> a.dot(array.array('d', [1., 2., 3., 4.]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
22.0
|
|
|
|
>>> b = SparseVector(4, [2, 4], [1.0, 2.0])
|
|
|
|
>>> a.dot(b)
|
|
|
|
0.0
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
array([ 22., 22.])
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> a.dot([1., 2., 3.])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> a.dot(np.array([1., 2.]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> a.dot(DenseVector([1., 2.]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> a.dot(np.zeros((3, 2)))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
if type(other) == np.ndarray:
|
2014-09-30 20:10:36 -04:00
|
|
|
if other.ndim == 2:
|
2014-05-25 20:15:01 -04:00
|
|
|
results = [self.dot(other[:, i]) for i in xrange(other.shape[1])]
|
2014-09-19 18:01:11 -04:00
|
|
|
return np.array(results)
|
2014-09-30 20:10:36 -04:00
|
|
|
elif other.ndim > 2:
|
|
|
|
raise ValueError("Cannot call dot with %d-dimensional array" % other.ndim)
|
|
|
|
|
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2014-09-30 20:10:36 -04:00
|
|
|
if type(other) in (np.ndarray, array.array, DenseVector):
|
2014-09-19 18:01:11 -04:00
|
|
|
result = 0.0
|
|
|
|
for i in xrange(len(self.indices)):
|
|
|
|
result += self.values[i] * other[self.indices[i]]
|
|
|
|
return result
|
|
|
|
|
|
|
|
elif type(other) is SparseVector:
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
result = 0.0
|
|
|
|
i, j = 0, 0
|
|
|
|
while i < len(self.indices) and j < len(other.indices):
|
|
|
|
if self.indices[i] == other.indices[j]:
|
|
|
|
result += self.values[i] * other.values[j]
|
|
|
|
i += 1
|
|
|
|
j += 1
|
|
|
|
elif self.indices[i] < other.indices[j]:
|
|
|
|
i += 1
|
|
|
|
else:
|
|
|
|
j += 1
|
|
|
|
return result
|
2014-09-30 20:10:36 -04:00
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
else:
|
|
|
|
return self.dot(_convert_to_vector(other))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
def squared_distance(self, other):
|
|
|
|
"""
|
|
|
|
Squared distance from a SparseVector or 1-dimensional NumPy array.
|
|
|
|
|
|
|
|
>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
|
|
|
|
>>> a.squared_distance(a)
|
|
|
|
0.0
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> a.squared_distance(array.array('d', [1., 2., 3., 4.]))
|
|
|
|
11.0
|
|
|
|
>>> a.squared_distance(np.array([1., 2., 3., 4.]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
11.0
|
|
|
|
>>> b = SparseVector(4, [2, 4], [1.0, 2.0])
|
|
|
|
>>> a.squared_distance(b)
|
|
|
|
30.0
|
|
|
|
>>> b.squared_distance(a)
|
|
|
|
30.0
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> b.squared_distance([1., 2.])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> b.squared_distance(SparseVector(3, [1,], [1.0,]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
2014-09-19 18:01:11 -04:00
|
|
|
if type(other) in (list, array.array, DenseVector, np.array, np.ndarray):
|
|
|
|
if type(other) is np.array and other.ndim != 1:
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
raise Exception("Cannot call squared_distance with %d-dimensional array" %
|
2014-05-25 20:15:01 -04:00
|
|
|
other.ndim)
|
2014-09-19 18:01:11 -04:00
|
|
|
result = 0.0
|
|
|
|
j = 0 # index into our own array
|
|
|
|
for i in xrange(len(other)):
|
|
|
|
if j < len(self.indices) and self.indices[j] == i:
|
|
|
|
diff = self.values[j] - other[i]
|
|
|
|
result += diff * diff
|
|
|
|
j += 1
|
|
|
|
else:
|
|
|
|
result += other[i] * other[i]
|
|
|
|
return result
|
|
|
|
|
|
|
|
elif type(other) is SparseVector:
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
result = 0.0
|
|
|
|
i, j = 0, 0
|
|
|
|
while i < len(self.indices) and j < len(other.indices):
|
|
|
|
if self.indices[i] == other.indices[j]:
|
|
|
|
diff = self.values[i] - other.values[j]
|
|
|
|
result += diff * diff
|
|
|
|
i += 1
|
|
|
|
j += 1
|
|
|
|
elif self.indices[i] < other.indices[j]:
|
|
|
|
result += self.values[i] * self.values[i]
|
|
|
|
i += 1
|
|
|
|
else:
|
|
|
|
result += other.values[j] * other.values[j]
|
|
|
|
j += 1
|
|
|
|
while i < len(self.indices):
|
|
|
|
result += self.values[i] * self.values[i]
|
|
|
|
i += 1
|
|
|
|
while j < len(other.indices):
|
|
|
|
result += other.values[j] * other.values[j]
|
|
|
|
j += 1
|
|
|
|
return result
|
2014-09-19 18:01:11 -04:00
|
|
|
else:
|
|
|
|
return self.squared_distance(_convert_to_vector(other))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
CC: mengxr dorx Main changes were examples to show usage across APIs.
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 21:01:39 -04:00
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Returns a copy of this SparseVector as a 1-dimensional NumPy array.
|
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
arr = np.zeros((self.size,), dtype=np.float64)
|
2014-11-24 19:37:14 -05:00
|
|
|
arr[self.indices] = self.values
|
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
CC: mengxr dorx Main changes were examples to show usage across APIs.
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 21:01:39 -04:00
|
|
|
return arr
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __len__(self):
|
|
|
|
return self.size
|
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def __str__(self):
|
2014-06-04 15:56:56 -04:00
|
|
|
inds = "[" + ",".join([str(i) for i in self.indices]) + "]"
|
|
|
|
vals = "[" + ",".join([str(v) for v in self.values]) + "]"
|
|
|
|
return "(" + ",".join((str(self.size), inds, vals)) + ")"
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
def __repr__(self):
|
|
|
|
inds = self.indices
|
|
|
|
vals = self.values
|
2014-10-28 06:50:22 -04:00
|
|
|
entries = ", ".join(["{0}: {1}".format(inds[i], _format_float(vals[i]))
|
|
|
|
for i in xrange(len(inds))])
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
return "SparseVector({0}, {{{1}}})".format(self.size, entries)
|
|
|
|
|
|
|
|
def __eq__(self, other):
|
|
|
|
"""
|
|
|
|
Test SparseVectors for equality.
|
|
|
|
|
|
|
|
>>> v1 = SparseVector(4, [(1, 1.0), (3, 5.5)])
|
|
|
|
>>> v2 = SparseVector(4, [(1, 1.0), (3, 5.5)])
|
|
|
|
>>> v1 == v2
|
|
|
|
True
|
|
|
|
>>> v1 != v2
|
|
|
|
False
|
|
|
|
"""
|
|
|
|
return (isinstance(other, self.__class__)
|
2014-05-25 20:15:01 -04:00
|
|
|
and other.size == self.size
|
2014-11-24 19:37:14 -05:00
|
|
|
and np.array_equal(other.indices, self.indices)
|
|
|
|
and np.array_equal(other.values, self.values))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2015-01-14 14:03:11 -05:00
|
|
|
def __getitem__(self, index):
|
|
|
|
inds = self.indices
|
|
|
|
vals = self.values
|
|
|
|
if not isinstance(index, int):
|
2015-04-20 13:44:09 -04:00
|
|
|
raise TypeError(
|
2015-01-14 14:03:11 -05:00
|
|
|
"Indices must be of type integer, got type %s" % type(index))
|
|
|
|
if index < 0:
|
|
|
|
index += self.size
|
|
|
|
if index >= self.size or index < 0:
|
|
|
|
raise ValueError("Index %d out of bounds." % index)
|
|
|
|
|
|
|
|
insert_index = np.searchsorted(inds, index)
|
|
|
|
row_ind = inds[insert_index]
|
|
|
|
if row_ind == index:
|
|
|
|
return vals[insert_index]
|
|
|
|
return 0.
|
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def __ne__(self, other):
|
|
|
|
return not self.__eq__(other)
|
|
|
|
|
|
|
|
|
|
|
|
class Vectors(object):
|
2014-08-06 15:58:24 -04:00
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
|
|
|
Factory methods for working with vectors. Note that dense vectors
|
|
|
|
are simply represented as NumPy array objects, so there is no need
|
|
|
|
to covert them for use in MLlib. For sparse vectors, the factory
|
|
|
|
methods in this class create an MLlib-compatible type, or users
|
|
|
|
can pass in SciPy's C{scipy.sparse} column vectors.
|
|
|
|
"""
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def sparse(size, *args):
|
|
|
|
"""
|
|
|
|
Create a sparse vector, using either a dictionary, a list of
|
|
|
|
(index, value) pairs, or two separate arrays of indices and
|
|
|
|
values (sorted by index).
|
|
|
|
|
2014-10-07 21:09:27 -04:00
|
|
|
:param size: Size of the vector.
|
|
|
|
:param args: Non-zero entries, as a dictionary, list of tupes,
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
or two sorted lists containing indices and values.
|
|
|
|
|
2015-04-16 19:20:57 -04:00
|
|
|
>>> Vectors.sparse(4, {1: 1.0, 3: 5.5})
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> Vectors.sparse(4, [(1, 1.0), (3, 5.5)])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> Vectors.sparse(4, [1, 3], [1.0, 5.5])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
|
|
|
return SparseVector(size, *args)
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def dense(elements):
|
|
|
|
"""
|
|
|
|
Create a dense vector of 64-bit floats from a Python list. Always
|
|
|
|
returns a NumPy array.
|
|
|
|
|
|
|
|
>>> Vectors.dense([1, 2, 3])
|
2014-10-28 06:50:22 -04:00
|
|
|
DenseVector([1.0, 2.0, 3.0])
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
return DenseVector(elements)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2014-06-04 15:56:56 -04:00
|
|
|
@staticmethod
|
|
|
|
def stringify(vector):
|
|
|
|
"""
|
|
|
|
Converts a vector into a string, which can be recognized by
|
|
|
|
Vectors.parse().
|
|
|
|
|
|
|
|
>>> Vectors.stringify(Vectors.sparse(2, [1], [1.0]))
|
|
|
|
'(2,[1],[1.0])'
|
|
|
|
>>> Vectors.stringify(Vectors.dense([0.0, 1.0]))
|
|
|
|
'[0.0,1.0]'
|
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
return str(vector)
|
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
@staticmethod
|
|
|
|
def squared_distance(v1, v2):
|
|
|
|
"""
|
|
|
|
Squared distance between two vectors.
|
|
|
|
a and b can be of type SparseVector, DenseVector, np.ndarray
|
|
|
|
or array.array.
|
|
|
|
|
|
|
|
>>> a = Vectors.sparse(4, [(0, 1), (3, 4)])
|
|
|
|
>>> b = Vectors.dense([2, 5, 4, 1])
|
|
|
|
>>> a.squared_distance(b)
|
|
|
|
51.0
|
|
|
|
"""
|
|
|
|
v1, v2 = _convert_to_vector(v1), _convert_to_vector(v2)
|
|
|
|
return v1.squared_distance(v2)
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def norm(vector, p):
|
|
|
|
"""
|
|
|
|
Find norm of the given vector.
|
|
|
|
"""
|
|
|
|
return _convert_to_vector(vector).norm(p)
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def parse(s):
|
|
|
|
"""Parse a string representation back into the Vector.
|
|
|
|
|
|
|
|
>>> Vectors.parse('[2,1,2 ]')
|
|
|
|
DenseVector([2.0, 1.0, 2.0])
|
|
|
|
>>> Vectors.parse(' ( 100, [0], [2])')
|
|
|
|
SparseVector(100, {0: 2.0})
|
|
|
|
"""
|
|
|
|
if s.find('(') == -1 and s.find('[') != -1:
|
|
|
|
return DenseVector.parse(s)
|
|
|
|
elif s.find('(') != -1:
|
|
|
|
return SparseVector.parse(s)
|
|
|
|
else:
|
|
|
|
raise ValueError(
|
|
|
|
"Cannot find tokens '[' or '(' from the input string.")
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def zeros(size):
|
|
|
|
return DenseVector(np.zeros(size))
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class Matrix(object):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
|
|
|
Represents a local matrix.
|
|
|
|
"""
|
|
|
|
|
2015-04-21 17:36:50 -04:00
|
|
|
def __init__(self, numRows, numCols, isTransposed=False):
|
2014-09-30 20:10:36 -04:00
|
|
|
self.numRows = numRows
|
|
|
|
self.numCols = numCols
|
2015-04-21 17:36:50 -04:00
|
|
|
self.isTransposed = isTransposed
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def toArray(self):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
|
|
|
Returns its elements in a NumPy ndarray.
|
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
raise NotImplementedError
|
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
@staticmethod
|
|
|
|
def _convert_to_array(array_like, dtype):
|
|
|
|
"""
|
|
|
|
Convert Matrix attributes which are array-like or buffer to array.
|
|
|
|
"""
|
2015-04-16 19:20:57 -04:00
|
|
|
if isinstance(array_like, bytes):
|
2015-04-10 02:10:13 -04:00
|
|
|
return np.frombuffer(array_like, dtype=dtype)
|
|
|
|
return np.asarray(array_like, dtype=dtype)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class DenseMatrix(Matrix):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
|
|
|
Column-major dense matrix.
|
|
|
|
"""
|
2015-04-21 17:36:50 -04:00
|
|
|
def __init__(self, numRows, numCols, values, isTransposed=False):
|
|
|
|
Matrix.__init__(self, numRows, numCols, isTransposed)
|
2015-04-10 02:10:13 -04:00
|
|
|
values = self._convert_to_array(values, np.float64)
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(values) == numRows * numCols
|
2014-09-19 18:01:11 -04:00
|
|
|
self.values = values
|
|
|
|
|
|
|
|
def __reduce__(self):
|
2015-04-21 17:36:50 -04:00
|
|
|
return DenseMatrix, (
|
|
|
|
self.numRows, self.numCols, self.values.tostring(),
|
|
|
|
int(self.isTransposed))
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Return an numpy.ndarray
|
|
|
|
|
2014-11-24 19:37:14 -05:00
|
|
|
>>> m = DenseMatrix(2, 2, range(4))
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> m.toArray()
|
2014-09-30 20:10:36 -04:00
|
|
|
array([[ 0., 2.],
|
|
|
|
[ 1., 3.]])
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
2015-04-21 17:36:50 -04:00
|
|
|
if self.isTransposed:
|
|
|
|
return np.asfortranarray(
|
|
|
|
self.values.reshape((self.numRows, self.numCols)))
|
|
|
|
else:
|
|
|
|
return self.values.reshape((self.numRows, self.numCols), order='F')
|
2014-11-24 19:37:14 -05:00
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
def toSparse(self):
|
|
|
|
"""Convert to SparseMatrix"""
|
2015-04-21 17:36:50 -04:00
|
|
|
if self.isTransposed:
|
|
|
|
values = np.ravel(self.toArray(), order='F')
|
|
|
|
else:
|
|
|
|
values = self.values
|
|
|
|
indices = np.nonzero(values)[0]
|
2015-04-16 19:20:57 -04:00
|
|
|
colCounts = np.bincount(indices // self.numRows)
|
2015-04-10 02:10:13 -04:00
|
|
|
colPtrs = np.cumsum(np.hstack(
|
|
|
|
(0, colCounts, np.zeros(self.numCols - colCounts.size))))
|
2015-04-21 17:36:50 -04:00
|
|
|
values = values[indices]
|
2015-04-10 02:10:13 -04:00
|
|
|
rowIndices = indices % self.numRows
|
|
|
|
|
|
|
|
return SparseMatrix(self.numRows, self.numCols, colPtrs, rowIndices, values)
|
|
|
|
|
2015-04-01 20:03:39 -04:00
|
|
|
def __getitem__(self, indices):
|
|
|
|
i, j = indices
|
|
|
|
if i < 0 or i >= self.numRows:
|
|
|
|
raise ValueError("Row index %d is out of range [0, %d)"
|
|
|
|
% (i, self.numRows))
|
|
|
|
if j >= self.numCols or j < 0:
|
|
|
|
raise ValueError("Column index %d is out of range [0, %d)"
|
|
|
|
% (j, self.numCols))
|
2015-04-21 17:36:50 -04:00
|
|
|
|
|
|
|
if self.isTransposed:
|
|
|
|
return self.values[i * self.numCols + j]
|
|
|
|
else:
|
|
|
|
return self.values[i + j * self.numRows]
|
2015-04-01 20:03:39 -04:00
|
|
|
|
2014-11-24 19:37:14 -05:00
|
|
|
def __eq__(self, other):
|
2015-04-21 17:36:50 -04:00
|
|
|
if (not isinstance(other, DenseMatrix) or
|
|
|
|
self.numRows != other.numRows or
|
|
|
|
self.numCols != other.numCols):
|
|
|
|
return False
|
|
|
|
|
|
|
|
self_values = np.ravel(self.toArray(), order='F')
|
|
|
|
other_values = np.ravel(other.toArray(), order='F')
|
|
|
|
return all(self_values == other_values)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2014-07-22 01:30:53 -04:00
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
class SparseMatrix(Matrix):
|
|
|
|
"""Sparse Matrix stored in CSC format."""
|
|
|
|
def __init__(self, numRows, numCols, colPtrs, rowIndices, values,
|
|
|
|
isTransposed=False):
|
2015-04-21 17:36:50 -04:00
|
|
|
Matrix.__init__(self, numRows, numCols, isTransposed)
|
2015-04-10 02:10:13 -04:00
|
|
|
self.colPtrs = self._convert_to_array(colPtrs, np.int32)
|
|
|
|
self.rowIndices = self._convert_to_array(rowIndices, np.int32)
|
|
|
|
self.values = self._convert_to_array(values, np.float64)
|
|
|
|
|
|
|
|
if self.isTransposed:
|
|
|
|
if self.colPtrs.size != numRows + 1:
|
|
|
|
raise ValueError("Expected colPtrs of size %d, got %d."
|
|
|
|
% (numRows + 1, self.colPtrs.size))
|
|
|
|
else:
|
|
|
|
if self.colPtrs.size != numCols + 1:
|
|
|
|
raise ValueError("Expected colPtrs of size %d, got %d."
|
|
|
|
% (numCols + 1, self.colPtrs.size))
|
|
|
|
if self.rowIndices.size != self.values.size:
|
|
|
|
raise ValueError("Expected rowIndices of length %d, got %d."
|
|
|
|
% (self.rowIndices.size, self.values.size))
|
|
|
|
|
|
|
|
def __reduce__(self):
|
|
|
|
return SparseMatrix, (
|
|
|
|
self.numRows, self.numCols, self.colPtrs.tostring(),
|
|
|
|
self.rowIndices.tostring(), self.values.tostring(),
|
2015-05-05 10:53:11 -04:00
|
|
|
int(self.isTransposed))
|
2015-04-10 02:10:13 -04:00
|
|
|
|
|
|
|
def __getitem__(self, indices):
|
|
|
|
i, j = indices
|
|
|
|
if i < 0 or i >= self.numRows:
|
|
|
|
raise ValueError("Row index %d is out of range [0, %d)"
|
|
|
|
% (i, self.numRows))
|
|
|
|
if j < 0 or j >= self.numCols:
|
|
|
|
raise ValueError("Column index %d is out of range [0, %d)"
|
|
|
|
% (j, self.numCols))
|
|
|
|
|
|
|
|
# If a CSR matrix is given, then the row index should be searched
|
|
|
|
# for in ColPtrs, and the column index should be searched for in the
|
|
|
|
# corresponding slice obtained from rowIndices.
|
|
|
|
if self.isTransposed:
|
|
|
|
j, i = i, j
|
|
|
|
|
|
|
|
colStart = self.colPtrs[j]
|
|
|
|
colEnd = self.colPtrs[j + 1]
|
|
|
|
nz = self.rowIndices[colStart: colEnd]
|
|
|
|
ind = np.searchsorted(nz, i) + colStart
|
|
|
|
if ind < colEnd and self.rowIndices[ind] == i:
|
|
|
|
return self.values[ind]
|
|
|
|
else:
|
|
|
|
return 0.0
|
|
|
|
|
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Return an numpy.ndarray
|
|
|
|
"""
|
|
|
|
A = np.zeros((self.numRows, self.numCols), dtype=np.float64, order='F')
|
|
|
|
for k in xrange(self.colPtrs.size - 1):
|
|
|
|
startptr = self.colPtrs[k]
|
|
|
|
endptr = self.colPtrs[k + 1]
|
|
|
|
if self.isTransposed:
|
|
|
|
A[k, self.rowIndices[startptr:endptr]] = self.values[startptr:endptr]
|
|
|
|
else:
|
|
|
|
A[self.rowIndices[startptr:endptr], k] = self.values[startptr:endptr]
|
|
|
|
return A
|
|
|
|
|
|
|
|
def toDense(self):
|
2015-04-21 17:36:50 -04:00
|
|
|
densevals = np.ravel(self.toArray(), order='F')
|
2015-04-10 02:10:13 -04:00
|
|
|
return DenseMatrix(self.numRows, self.numCols, densevals)
|
|
|
|
|
|
|
|
# TODO: More efficient implementation:
|
|
|
|
def __eq__(self, other):
|
2015-05-05 10:53:11 -04:00
|
|
|
return np.all(self.toArray() == other.toArray())
|
2015-04-10 02:10:13 -04:00
|
|
|
|
|
|
|
|
[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes #3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-05 00:35:52 -05:00
|
|
|
class Matrices(object):
|
|
|
|
@staticmethod
|
|
|
|
def dense(numRows, numCols, values):
|
|
|
|
"""
|
|
|
|
Create a DenseMatrix
|
|
|
|
"""
|
|
|
|
return DenseMatrix(numRows, numCols, values)
|
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
@staticmethod
|
|
|
|
def sparse(numRows, numCols, colPtrs, rowIndices, values):
|
|
|
|
"""
|
|
|
|
Create a SparseMatrix
|
|
|
|
"""
|
|
|
|
return SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)
|
|
|
|
|
[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes #3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-05 00:35:52 -05:00
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def _test():
|
|
|
|
import doctest
|
|
|
|
(failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
|
|
|
|
if failure_count:
|
|
|
|
exit(-1)
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
_test()
|