[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
#
|
|
|
|
|
|
|
|
"""
|
|
|
|
MLlib utilities for linear algebra. For dense vectors, MLlib
|
|
|
|
uses the NumPy C{array} type, so you can simply pass NumPy arrays
|
|
|
|
around. For sparse vectors, users can construct a L{SparseVector}
|
|
|
|
object from MLlib or pass SciPy C{scipy.sparse} column vectors if
|
|
|
|
SciPy is available in their environment.
|
|
|
|
"""
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
import sys
|
|
|
|
import array
|
2015-09-15 00:37:43 -04:00
|
|
|
import struct
|
2015-04-16 19:20:57 -04:00
|
|
|
|
|
|
|
if sys.version >= '3':
|
|
|
|
basestring = str
|
|
|
|
xrange = range
|
|
|
|
import copyreg as copy_reg
|
2015-07-17 15:43:58 -04:00
|
|
|
long = int
|
2015-04-16 19:20:57 -04:00
|
|
|
else:
|
2015-07-08 16:19:27 -04:00
|
|
|
from itertools import izip as zip
|
2015-04-16 19:20:57 -04:00
|
|
|
import copy_reg
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
import numpy as np
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
[SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
* `RowMatrix` <sup>**[1]**</sup>
1. `computeGramianMatrix`
2. `computeCovariance`
3. `computeColumnSummaryStatistics`
4. `columnSimilarities`
5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
1. `computeGramianMatrix`
* `CoordinateMatrix`
1. `transpose`
* `BlockMatrix`
1. `validate`
2. `cache`
3. `persist`
4. `transpose`
**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 13:48:05 -04:00
|
|
|
from pyspark import since
|
2016-06-30 20:52:15 -04:00
|
|
|
from pyspark.ml import linalg as newlinalg
|
2015-02-09 23:49:22 -05:00
|
|
|
from pyspark.sql.types import UserDefinedType, StructField, StructType, ArrayType, DoubleType, \
|
2015-06-17 14:10:16 -04:00
|
|
|
IntegerType, ByteType, BooleanType
|
2014-11-04 01:29:48 -05:00
|
|
|
|
2014-10-16 17:56:50 -04:00
|
|
|
|
2015-04-29 00:15:47 -04:00
|
|
|
__all__ = ['Vector', 'DenseVector', 'SparseVector', 'Vectors',
|
[SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
* `RowMatrix` <sup>**[1]**</sup>
1. `computeGramianMatrix`
2. `computeCovariance`
3. `computeColumnSummaryStatistics`
4. `columnSimilarities`
5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
1. `computeGramianMatrix`
* `CoordinateMatrix`
1. `transpose`
* `BlockMatrix`
1. `validate`
2. `cache`
3. `persist`
4. `transpose`
**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 13:48:05 -04:00
|
|
|
'Matrix', 'DenseMatrix', 'SparseMatrix', 'Matrices',
|
|
|
|
'QRDecomposition']
|
2014-09-03 14:49:45 -04:00
|
|
|
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
if sys.version_info[:2] == (2, 7):
|
|
|
|
# speed up pickling array in Python 2.7
|
|
|
|
def fast_pickle_array(ar):
|
|
|
|
return array.array, (ar.typecode, ar.tostring())
|
|
|
|
copy_reg.pickle(array.array, fast_pickle_array)
|
|
|
|
|
|
|
|
|
|
|
|
# Check whether we have SciPy. MLlib works without it too, but if we have it, some methods,
|
|
|
|
# such as _dot and _serialize_double_vector, start to support scipy.sparse matrices.
|
|
|
|
|
|
|
|
try:
|
|
|
|
import scipy.sparse
|
|
|
|
_have_scipy = True
|
|
|
|
except:
|
|
|
|
# No SciPy in environment, but that's okay
|
|
|
|
_have_scipy = False
|
|
|
|
|
|
|
|
|
|
|
|
def _convert_to_vector(l):
|
|
|
|
if isinstance(l, Vector):
|
|
|
|
return l
|
2015-04-16 19:20:57 -04:00
|
|
|
elif type(l) in (array.array, np.array, np.ndarray, list, tuple, xrange):
|
2014-09-19 18:01:11 -04:00
|
|
|
return DenseVector(l)
|
|
|
|
elif _have_scipy and scipy.sparse.issparse(l):
|
|
|
|
assert l.shape[1] == 1, "Expected column vector"
|
[SPARK-20214][ML] Make sure converted csc matrix has sorted indices
## What changes were proposed in this pull request?
`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
from scipy.sparse import lil_matrix
lil = lil_matrix((4, 1))
lil[1, 0] = 1
lil[3, 0] = 2
_convert_to_vector(lil.todok())
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
return SparseVector(l.shape[0], csc.indices, csc.data)
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
% (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing
A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
>>> from scipy.sparse import lil_matrix
>>> lil = lil_matrix((4, 1))
>>> lil[1, 0] = 1
>>> lil[3, 0] = 2
>>> dok = lil.todok()
>>> csc = dok.tocsc()
>>> csc.has_sorted_indices
0
>>> csc.indices
array([3, 1], dtype=int32)
I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17532 from viirya/make-sure-sorted-indices.
2017-04-05 20:46:44 -04:00
|
|
|
# Make sure the converted csc_matrix has sorted indices.
|
2014-09-19 18:01:11 -04:00
|
|
|
csc = l.tocsc()
|
[SPARK-20214][ML] Make sure converted csc matrix has sorted indices
## What changes were proposed in this pull request?
`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
from scipy.sparse import lil_matrix
lil = lil_matrix((4, 1))
lil[1, 0] = 1
lil[3, 0] = 2
_convert_to_vector(lil.todok())
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
return SparseVector(l.shape[0], csc.indices, csc.data)
File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
% (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing
A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
>>> from scipy.sparse import lil_matrix
>>> lil = lil_matrix((4, 1))
>>> lil[1, 0] = 1
>>> lil[3, 0] = 2
>>> dok = lil.todok()
>>> csc = dok.tocsc()
>>> csc.has_sorted_indices
0
>>> csc.indices
array([3, 1], dtype=int32)
I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17532 from viirya/make-sure-sorted-indices.
2017-04-05 20:46:44 -04:00
|
|
|
if not csc.has_sorted_indices:
|
|
|
|
csc.sort_indices()
|
2014-09-19 18:01:11 -04:00
|
|
|
return SparseVector(l.shape[0], csc.indices, csc.data)
|
|
|
|
else:
|
|
|
|
raise TypeError("Cannot convert type %s into Vector" % type(l))
|
|
|
|
|
|
|
|
|
2014-09-30 20:10:36 -04:00
|
|
|
def _vector_size(v):
|
|
|
|
"""
|
|
|
|
Returns the size of the vector.
|
|
|
|
|
|
|
|
>>> _vector_size([1., 2., 3.])
|
|
|
|
3
|
|
|
|
>>> _vector_size((1., 2., 3.))
|
|
|
|
3
|
|
|
|
>>> _vector_size(array.array('d', [1., 2., 3.]))
|
|
|
|
3
|
|
|
|
>>> _vector_size(np.zeros(3))
|
|
|
|
3
|
|
|
|
>>> _vector_size(np.zeros((3, 1)))
|
|
|
|
3
|
|
|
|
>>> _vector_size(np.zeros((1, 3)))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
ValueError: Cannot treat an ndarray of shape (1, 3) as a vector
|
|
|
|
"""
|
|
|
|
if isinstance(v, Vector):
|
|
|
|
return len(v)
|
2015-04-16 19:20:57 -04:00
|
|
|
elif type(v) in (array.array, list, tuple, xrange):
|
2014-09-30 20:10:36 -04:00
|
|
|
return len(v)
|
|
|
|
elif type(v) == np.ndarray:
|
|
|
|
if v.ndim == 1 or (v.ndim == 2 and v.shape[1] == 1):
|
|
|
|
return len(v)
|
|
|
|
else:
|
|
|
|
raise ValueError("Cannot treat an ndarray of shape %s as a vector" % str(v.shape))
|
|
|
|
elif _have_scipy and scipy.sparse.issparse(v):
|
|
|
|
assert v.shape[1] == 1, "Expected column vector"
|
|
|
|
return v.shape[0]
|
|
|
|
else:
|
|
|
|
raise TypeError("Cannot treat type %s as a vector" % type(v))
|
|
|
|
|
|
|
|
|
2014-10-28 06:50:22 -04:00
|
|
|
def _format_float(f, digits=4):
|
|
|
|
s = str(round(f, digits))
|
|
|
|
if '.' in s:
|
|
|
|
s = s[:s.index('.') + 1 + digits]
|
|
|
|
return s
|
|
|
|
|
|
|
|
|
2015-07-08 16:19:27 -04:00
|
|
|
def _format_float_list(l):
|
|
|
|
return [_format_float(x) for x in l]
|
|
|
|
|
|
|
|
|
2015-09-15 00:37:43 -04:00
|
|
|
def _double_to_long_bits(value):
|
|
|
|
if np.isnan(value):
|
|
|
|
value = float('nan')
|
|
|
|
# pack double into 64 bits, then unpack as long int
|
|
|
|
return struct.unpack('Q', struct.pack('d', value))[0]
|
|
|
|
|
|
|
|
|
2014-11-04 01:29:48 -05:00
|
|
|
class VectorUDT(UserDefinedType):
|
|
|
|
"""
|
|
|
|
SQL user-defined type (UDT) for Vector.
|
|
|
|
"""
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def sqlType(cls):
|
|
|
|
return StructType([
|
|
|
|
StructField("type", ByteType(), False),
|
|
|
|
StructField("size", IntegerType(), True),
|
|
|
|
StructField("indices", ArrayType(IntegerType(), False), True),
|
|
|
|
StructField("values", ArrayType(DoubleType(), False), True)])
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def module(cls):
|
|
|
|
return "pyspark.mllib.linalg"
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def scalaUDT(cls):
|
|
|
|
return "org.apache.spark.mllib.linalg.VectorUDT"
|
|
|
|
|
|
|
|
def serialize(self, obj):
|
|
|
|
if isinstance(obj, SparseVector):
|
|
|
|
indices = [int(i) for i in obj.indices]
|
|
|
|
values = [float(v) for v in obj.values]
|
|
|
|
return (0, obj.size, indices, values)
|
|
|
|
elif isinstance(obj, DenseVector):
|
|
|
|
values = [float(v) for v in obj]
|
|
|
|
return (1, None, None, values)
|
|
|
|
else:
|
2015-04-20 13:44:09 -04:00
|
|
|
raise TypeError("cannot serialize %r of type %r" % (obj, type(obj)))
|
2014-11-04 01:29:48 -05:00
|
|
|
|
|
|
|
def deserialize(self, datum):
|
|
|
|
assert len(datum) == 4, \
|
|
|
|
"VectorUDT.deserialize given row with length %d but requires 4" % len(datum)
|
|
|
|
tpe = datum[0]
|
|
|
|
if tpe == 0:
|
|
|
|
return SparseVector(datum[1], datum[2], datum[3])
|
|
|
|
elif tpe == 1:
|
|
|
|
return DenseVector(datum[3])
|
|
|
|
else:
|
|
|
|
raise ValueError("do not recognize type %r" % tpe)
|
|
|
|
|
2015-03-02 20:14:34 -05:00
|
|
|
def simpleString(self):
|
|
|
|
return "vector"
|
|
|
|
|
2014-11-04 01:29:48 -05:00
|
|
|
|
2015-06-17 14:10:16 -04:00
|
|
|
class MatrixUDT(UserDefinedType):
|
|
|
|
"""
|
|
|
|
SQL user-defined type (UDT) for Matrix.
|
|
|
|
"""
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def sqlType(cls):
|
|
|
|
return StructType([
|
|
|
|
StructField("type", ByteType(), False),
|
|
|
|
StructField("numRows", IntegerType(), False),
|
|
|
|
StructField("numCols", IntegerType(), False),
|
|
|
|
StructField("colPtrs", ArrayType(IntegerType(), False), True),
|
|
|
|
StructField("rowIndices", ArrayType(IntegerType(), False), True),
|
|
|
|
StructField("values", ArrayType(DoubleType(), False), True),
|
|
|
|
StructField("isTransposed", BooleanType(), False)])
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def module(cls):
|
|
|
|
return "pyspark.mllib.linalg"
|
|
|
|
|
|
|
|
@classmethod
|
|
|
|
def scalaUDT(cls):
|
|
|
|
return "org.apache.spark.mllib.linalg.MatrixUDT"
|
|
|
|
|
|
|
|
def serialize(self, obj):
|
|
|
|
if isinstance(obj, SparseMatrix):
|
|
|
|
colPtrs = [int(i) for i in obj.colPtrs]
|
|
|
|
rowIndices = [int(i) for i in obj.rowIndices]
|
|
|
|
values = [float(v) for v in obj.values]
|
|
|
|
return (0, obj.numRows, obj.numCols, colPtrs,
|
|
|
|
rowIndices, values, bool(obj.isTransposed))
|
|
|
|
elif isinstance(obj, DenseMatrix):
|
|
|
|
values = [float(v) for v in obj.values]
|
|
|
|
return (1, obj.numRows, obj.numCols, None, None, values,
|
|
|
|
bool(obj.isTransposed))
|
|
|
|
else:
|
|
|
|
raise TypeError("cannot serialize type %r" % (type(obj)))
|
|
|
|
|
|
|
|
def deserialize(self, datum):
|
|
|
|
assert len(datum) == 7, \
|
|
|
|
"MatrixUDT.deserialize given row with length %d but requires 7" % len(datum)
|
|
|
|
tpe = datum[0]
|
|
|
|
if tpe == 0:
|
|
|
|
return SparseMatrix(*datum[1:])
|
|
|
|
elif tpe == 1:
|
|
|
|
return DenseMatrix(datum[1], datum[2], datum[5], datum[6])
|
|
|
|
else:
|
|
|
|
raise ValueError("do not recognize type %r" % tpe)
|
|
|
|
|
|
|
|
def simpleString(self):
|
|
|
|
return "matrix"
|
|
|
|
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
class Vector(object):
|
2014-11-04 01:29:48 -05:00
|
|
|
|
|
|
|
__UDT__ = VectorUDT()
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
|
|
|
Abstract class for DenseVector and SparseVector
|
|
|
|
"""
|
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Convert the vector into an numpy.ndarray
|
2015-09-21 17:24:19 -04:00
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
:return: numpy.ndarray
|
|
|
|
"""
|
|
|
|
raise NotImplementedError
|
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
def asML(self):
|
|
|
|
"""
|
|
|
|
Convert this vector to the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:return: :py:class:`pyspark.ml.linalg.Vector`
|
|
|
|
"""
|
|
|
|
raise NotImplementedError
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class DenseVector(Vector):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
2015-04-01 16:29:04 -04:00
|
|
|
A dense vector represented by a value array. We use numpy array for
|
|
|
|
storage and arithmetics will be delegated to the underlying numpy
|
|
|
|
array.
|
|
|
|
|
|
|
|
>>> v = Vectors.dense([1.0, 2.0])
|
|
|
|
>>> u = Vectors.dense([3.0, 4.0])
|
|
|
|
>>> v + u
|
|
|
|
DenseVector([4.0, 6.0])
|
|
|
|
>>> 2 - v
|
|
|
|
DenseVector([1.0, 0.0])
|
|
|
|
>>> v / 2
|
|
|
|
DenseVector([0.5, 1.0])
|
|
|
|
>>> v * u
|
|
|
|
DenseVector([3.0, 8.0])
|
|
|
|
>>> u / v
|
|
|
|
DenseVector([3.0, 2.0])
|
|
|
|
>>> u % 2
|
|
|
|
DenseVector([1.0, 0.0])
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
def __init__(self, ar):
|
2015-04-16 19:20:57 -04:00
|
|
|
if isinstance(ar, bytes):
|
2014-11-24 19:37:14 -05:00
|
|
|
ar = np.frombuffer(ar, dtype=np.float64)
|
|
|
|
elif not isinstance(ar, np.ndarray):
|
|
|
|
ar = np.array(ar, dtype=np.float64)
|
|
|
|
if ar.dtype != np.float64:
|
2015-01-05 16:10:59 -05:00
|
|
|
ar = ar.astype(np.float64)
|
2014-09-19 18:01:11 -04:00
|
|
|
self.array = ar
|
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
@staticmethod
|
|
|
|
def parse(s):
|
|
|
|
"""
|
|
|
|
Parse string representation back into the DenseVector.
|
|
|
|
|
|
|
|
>>> DenseVector.parse(' [ 0.0,1.0,2.0, 3.0]')
|
|
|
|
DenseVector([0.0, 1.0, 2.0, 3.0])
|
|
|
|
"""
|
|
|
|
start = s.find('[')
|
|
|
|
if start == -1:
|
|
|
|
raise ValueError("Array should start with '['.")
|
|
|
|
end = s.find(']')
|
|
|
|
if end == -1:
|
|
|
|
raise ValueError("Array should end with ']'.")
|
|
|
|
s = s[start + 1: end]
|
|
|
|
|
|
|
|
try:
|
2016-04-21 06:29:24 -04:00
|
|
|
values = [float(val) for val in s.split(',') if val]
|
2015-05-07 17:02:05 -04:00
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Unable to parse values from %s" % s)
|
|
|
|
return DenseVector(values)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __reduce__(self):
|
2014-11-24 19:37:14 -05:00
|
|
|
return DenseVector, (self.array.tostring(),)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
def numNonzeros(self):
|
2015-09-21 01:55:24 -04:00
|
|
|
"""
|
|
|
|
Number of nonzero elements. This scans all active values and count non zeros
|
|
|
|
"""
|
2015-05-07 17:02:05 -04:00
|
|
|
return np.count_nonzero(self.array)
|
|
|
|
|
|
|
|
def norm(self, p):
|
|
|
|
"""
|
2015-09-21 01:55:24 -04:00
|
|
|
Calculates the norm of a DenseVector.
|
2015-05-07 17:02:05 -04:00
|
|
|
|
|
|
|
>>> a = DenseVector([0, -1, 2, -3])
|
|
|
|
>>> a.norm(2)
|
|
|
|
3.7...
|
|
|
|
>>> a.norm(1)
|
|
|
|
6.0
|
|
|
|
"""
|
|
|
|
return np.linalg.norm(self.array, p)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def dot(self, other):
|
|
|
|
"""
|
|
|
|
Compute the dot product of two Vectors. We support
|
|
|
|
(Numpy array, list, SparseVector, or SciPy sparse)
|
|
|
|
and a target NumPy array that is either 1- or 2-dimensional.
|
|
|
|
Equivalent to calling numpy.dot of the two vectors.
|
|
|
|
|
|
|
|
>>> dense = DenseVector(array.array('d', [1., 2.]))
|
|
|
|
>>> dense.dot(dense)
|
|
|
|
5.0
|
|
|
|
>>> dense.dot(SparseVector(2, [0, 1], [2., 1.]))
|
|
|
|
4.0
|
|
|
|
>>> dense.dot(range(1, 3))
|
|
|
|
5.0
|
|
|
|
>>> dense.dot(np.array(range(1, 3)))
|
|
|
|
5.0
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> dense.dot([1.,])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F'))
|
|
|
|
array([ 5., 11.])
|
|
|
|
>>> dense.dot(np.reshape([1., 2., 3.], (3, 1), order='F'))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
2014-11-24 19:37:14 -05:00
|
|
|
if type(other) == np.ndarray:
|
|
|
|
if other.ndim > 1:
|
|
|
|
assert len(self) == other.shape[0], "dimension mismatch"
|
|
|
|
return np.dot(self.array, other)
|
2014-09-19 18:01:11 -04:00
|
|
|
elif _have_scipy and scipy.sparse.issparse(other):
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == other.shape[0], "dimension mismatch"
|
|
|
|
return other.transpose().dot(self.toArray())
|
2014-09-19 18:01:11 -04:00
|
|
|
else:
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
|
|
|
if isinstance(other, SparseVector):
|
|
|
|
return other.dot(self)
|
|
|
|
elif isinstance(other, Vector):
|
|
|
|
return np.dot(self.toArray(), other.toArray())
|
|
|
|
else:
|
|
|
|
return np.dot(self.toArray(), other)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def squared_distance(self, other):
|
|
|
|
"""
|
|
|
|
Squared distance of two Vectors.
|
|
|
|
|
|
|
|
>>> dense1 = DenseVector(array.array('d', [1., 2.]))
|
|
|
|
>>> dense1.squared_distance(dense1)
|
|
|
|
0.0
|
|
|
|
>>> dense2 = np.array([2., 1.])
|
|
|
|
>>> dense1.squared_distance(dense2)
|
|
|
|
2.0
|
|
|
|
>>> dense3 = [2., 1.]
|
|
|
|
>>> dense1.squared_distance(dense3)
|
|
|
|
2.0
|
|
|
|
>>> sparse1 = SparseVector(2, [0, 1], [2., 1.])
|
|
|
|
>>> dense1.squared_distance(sparse1)
|
|
|
|
2.0
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> dense1.squared_distance([1.,])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> dense1.squared_distance(SparseVector(1, [0,], [1.,]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
2014-09-19 18:01:11 -04:00
|
|
|
if isinstance(other, SparseVector):
|
|
|
|
return other.squared_distance(self)
|
|
|
|
elif _have_scipy and scipy.sparse.issparse(other):
|
|
|
|
return _convert_to_vector(other).squared_distance(self)
|
|
|
|
|
|
|
|
if isinstance(other, Vector):
|
|
|
|
other = other.toArray()
|
|
|
|
elif not isinstance(other, np.ndarray):
|
|
|
|
other = np.array(other)
|
|
|
|
diff = self.toArray() - other
|
|
|
|
return np.dot(diff, diff)
|
|
|
|
|
|
|
|
def toArray(self):
|
2015-09-21 01:55:24 -04:00
|
|
|
"""
|
|
|
|
Returns an numpy.ndarray
|
|
|
|
"""
|
2014-11-24 19:37:14 -05:00
|
|
|
return self.array
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
def asML(self):
|
|
|
|
"""
|
|
|
|
Convert this vector to the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:return: :py:class:`pyspark.ml.linalg.DenseVector`
|
|
|
|
|
|
|
|
.. versionadded:: 2.0.0
|
|
|
|
"""
|
|
|
|
return newlinalg.DenseVector(self.array)
|
|
|
|
|
2015-09-16 02:25:51 -04:00
|
|
|
@property
|
|
|
|
def values(self):
|
2015-09-21 01:55:24 -04:00
|
|
|
"""
|
|
|
|
Returns a list of values
|
|
|
|
"""
|
2015-09-16 02:25:51 -04:00
|
|
|
return self.array
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __getitem__(self, item):
|
|
|
|
return self.array[item]
|
|
|
|
|
|
|
|
def __len__(self):
|
|
|
|
return len(self.array)
|
|
|
|
|
|
|
|
def __str__(self):
|
|
|
|
return "[" + ",".join([str(v) for v in self.array]) + "]"
|
|
|
|
|
|
|
|
def __repr__(self):
|
2014-10-28 06:50:22 -04:00
|
|
|
return "DenseVector([%s])" % (', '.join(_format_float(i) for i in self.array))
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def __eq__(self, other):
|
2015-09-15 00:37:43 -04:00
|
|
|
if isinstance(other, DenseVector):
|
|
|
|
return np.array_equal(self.array, other.array)
|
|
|
|
elif isinstance(other, SparseVector):
|
|
|
|
if len(self) != other.size:
|
|
|
|
return False
|
|
|
|
return Vectors._equals(list(xrange(len(self))), self.array, other.indices, other.values)
|
|
|
|
return False
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def __ne__(self, other):
|
|
|
|
return not self == other
|
|
|
|
|
2015-09-15 00:37:43 -04:00
|
|
|
def __hash__(self):
|
|
|
|
size = len(self)
|
|
|
|
result = 31 + size
|
|
|
|
nnz = 0
|
|
|
|
i = 0
|
|
|
|
while i < size and nnz < 128:
|
|
|
|
if self.array[i] != 0:
|
|
|
|
result = 31 * result + i
|
|
|
|
bits = _double_to_long_bits(self.array[i])
|
|
|
|
result = 31 * result + (bits ^ (bits >> 32))
|
|
|
|
nnz += 1
|
|
|
|
i += 1
|
|
|
|
return result
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __getattr__(self, item):
|
|
|
|
return getattr(self.array, item)
|
|
|
|
|
2015-04-01 16:29:04 -04:00
|
|
|
def _delegate(op):
|
|
|
|
def func(self, other):
|
|
|
|
if isinstance(other, DenseVector):
|
|
|
|
other = other.array
|
|
|
|
return DenseVector(getattr(self.array, op)(other))
|
|
|
|
return func
|
|
|
|
|
|
|
|
__neg__ = _delegate("__neg__")
|
|
|
|
__add__ = _delegate("__add__")
|
|
|
|
__sub__ = _delegate("__sub__")
|
|
|
|
__mul__ = _delegate("__mul__")
|
|
|
|
__div__ = _delegate("__div__")
|
2015-04-16 19:20:57 -04:00
|
|
|
__truediv__ = _delegate("__truediv__")
|
2015-04-01 16:29:04 -04:00
|
|
|
__mod__ = _delegate("__mod__")
|
|
|
|
__radd__ = _delegate("__radd__")
|
|
|
|
__rsub__ = _delegate("__rsub__")
|
|
|
|
__rmul__ = _delegate("__rmul__")
|
|
|
|
__rdiv__ = _delegate("__rdiv__")
|
2015-04-16 19:20:57 -04:00
|
|
|
__rtruediv__ = _delegate("__rtruediv__")
|
2015-04-01 16:29:04 -04:00
|
|
|
__rmod__ = _delegate("__rmod__")
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class SparseVector(Vector):
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
|
|
|
A simple sparse vector class for passing data to MLlib. Users may
|
|
|
|
alternatively pass SciPy's {scipy.sparse} data types.
|
|
|
|
"""
|
|
|
|
def __init__(self, size, *args):
|
|
|
|
"""
|
|
|
|
Create a sparse vector, using either a dictionary, a list of
|
|
|
|
(index, value) pairs, or two separate arrays of indices and
|
|
|
|
values (sorted by index).
|
|
|
|
|
2014-10-07 21:09:27 -04:00
|
|
|
:param size: Size of the vector.
|
2015-07-09 11:16:26 -04:00
|
|
|
:param args: Active entries, as a dictionary {index: value, ...},
|
2015-09-21 01:55:24 -04:00
|
|
|
a list of tuples [(index, value), ...], or a list of strictly
|
|
|
|
increasing indices and a list of corresponding values [index, ...],
|
2015-07-09 11:16:26 -04:00
|
|
|
[value, ...]. Inactive entries are treated as zeros.
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2015-04-16 19:20:57 -04:00
|
|
|
>>> SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> SparseVector(4, [(1, 1.0), (3, 5.5)])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> SparseVector(4, [1, 3], [1.0, 5.5])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2014-05-07 19:01:11 -04:00
|
|
|
self.size = int(size)
|
2015-07-09 11:16:26 -04:00
|
|
|
""" Size of the vector. """
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
assert 1 <= len(args) <= 2, "must pass either 2 or 3 arguments"
|
|
|
|
if len(args) == 1:
|
|
|
|
pairs = args[0]
|
|
|
|
if type(pairs) == dict:
|
2014-05-25 20:15:01 -04:00
|
|
|
pairs = pairs.items()
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
pairs = sorted(pairs)
|
2014-11-24 19:37:14 -05:00
|
|
|
self.indices = np.array([p[0] for p in pairs], dtype=np.int32)
|
2015-07-09 11:16:26 -04:00
|
|
|
""" A list of indices corresponding to active entries. """
|
2014-11-24 19:37:14 -05:00
|
|
|
self.values = np.array([p[1] for p in pairs], dtype=np.float64)
|
2015-07-09 11:16:26 -04:00
|
|
|
""" A list of values corresponding to active entries. """
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
else:
|
2015-04-16 19:20:57 -04:00
|
|
|
if isinstance(args[0], bytes):
|
|
|
|
assert isinstance(args[1], bytes), "values should be string too"
|
2014-11-24 19:37:14 -05:00
|
|
|
if args[0]:
|
|
|
|
self.indices = np.frombuffer(args[0], np.int32)
|
|
|
|
self.values = np.frombuffer(args[1], np.float64)
|
|
|
|
else:
|
|
|
|
# np.frombuffer() doesn't work well with empty string in older version
|
|
|
|
self.indices = np.array([], dtype=np.int32)
|
|
|
|
self.values = np.array([], dtype=np.float64)
|
|
|
|
else:
|
|
|
|
self.indices = np.array(args[0], dtype=np.int32)
|
|
|
|
self.values = np.array(args[1], dtype=np.float64)
|
|
|
|
assert len(self.indices) == len(self.values), "index and value arrays not same length"
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
for i in xrange(len(self.indices) - 1):
|
|
|
|
if self.indices[i] >= self.indices[i + 1]:
|
2016-01-06 13:48:14 -05:00
|
|
|
raise TypeError(
|
|
|
|
"Indices %s and %s are not strictly increasing"
|
|
|
|
% (self.indices[i], self.indices[i + 1]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
def numNonzeros(self):
|
2015-09-21 01:55:24 -04:00
|
|
|
"""
|
|
|
|
Number of nonzero elements. This scans all active values and count non zeros.
|
|
|
|
"""
|
2015-05-07 17:02:05 -04:00
|
|
|
return np.count_nonzero(self.values)
|
|
|
|
|
|
|
|
def norm(self, p):
|
|
|
|
"""
|
2015-09-21 01:55:24 -04:00
|
|
|
Calculates the norm of a SparseVector.
|
2015-05-07 17:02:05 -04:00
|
|
|
|
|
|
|
>>> a = SparseVector(4, [0, 1], [3., -4.])
|
|
|
|
>>> a.norm(1)
|
|
|
|
7.0
|
|
|
|
>>> a.norm(2)
|
|
|
|
5.0
|
|
|
|
"""
|
|
|
|
return np.linalg.norm(self.values, p)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __reduce__(self):
|
2015-05-07 17:02:05 -04:00
|
|
|
return (
|
|
|
|
SparseVector,
|
|
|
|
(self.size, self.indices.tostring(), self.values.tostring()))
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def parse(s):
|
|
|
|
"""
|
2016-02-16 08:01:21 -05:00
|
|
|
Parse string representation back into the SparseVector.
|
2015-05-07 17:02:05 -04:00
|
|
|
|
|
|
|
>>> SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] )')
|
|
|
|
SparseVector(4, {0: 4.0, 1: 5.0})
|
|
|
|
"""
|
|
|
|
start = s.find('(')
|
|
|
|
if start == -1:
|
|
|
|
raise ValueError("Tuple should start with '('")
|
|
|
|
end = s.find(')')
|
2016-06-18 01:41:05 -04:00
|
|
|
if end == -1:
|
2015-05-07 17:02:05 -04:00
|
|
|
raise ValueError("Tuple should end with ')'")
|
|
|
|
s = s[start + 1: end].strip()
|
|
|
|
|
|
|
|
size = s[: s.find(',')]
|
|
|
|
try:
|
|
|
|
size = int(size)
|
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Cannot parse size %s." % size)
|
|
|
|
|
|
|
|
ind_start = s.find('[')
|
|
|
|
if ind_start == -1:
|
|
|
|
raise ValueError("Indices array should start with '['.")
|
|
|
|
ind_end = s.find(']')
|
|
|
|
if ind_end == -1:
|
|
|
|
raise ValueError("Indices array should end with ']'")
|
|
|
|
new_s = s[ind_start + 1: ind_end]
|
|
|
|
ind_list = new_s.split(',')
|
|
|
|
try:
|
2016-04-21 06:29:24 -04:00
|
|
|
indices = [int(ind) for ind in ind_list if ind]
|
2015-05-07 17:02:05 -04:00
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Unable to parse indices from %s." % new_s)
|
|
|
|
s = s[ind_end + 1:].strip()
|
|
|
|
|
|
|
|
val_start = s.find('[')
|
|
|
|
if val_start == -1:
|
|
|
|
raise ValueError("Values array should start with '['.")
|
|
|
|
val_end = s.find(']')
|
|
|
|
if val_end == -1:
|
|
|
|
raise ValueError("Values array should end with ']'.")
|
|
|
|
val_list = s[val_start + 1: val_end].split(',')
|
|
|
|
try:
|
2016-04-21 06:29:24 -04:00
|
|
|
values = [float(val) for val in val_list if val]
|
2015-05-07 17:02:05 -04:00
|
|
|
except ValueError:
|
|
|
|
raise ValueError("Unable to parse values from %s." % s)
|
|
|
|
return SparseVector(size, indices, values)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def dot(self, other):
|
|
|
|
"""
|
|
|
|
Dot product with a SparseVector or 1- or 2-dimensional Numpy array.
|
|
|
|
|
|
|
|
>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
|
|
|
|
>>> a.dot(a)
|
|
|
|
25.0
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> a.dot(array.array('d', [1., 2., 3., 4.]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
22.0
|
2015-07-20 19:49:55 -04:00
|
|
|
>>> b = SparseVector(4, [2], [1.0])
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
>>> a.dot(b)
|
|
|
|
0.0
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
array([ 22., 22.])
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> a.dot([1., 2., 3.])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> a.dot(np.array([1., 2.]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> a.dot(DenseVector([1., 2.]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> a.dot(np.zeros((3, 2)))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2015-07-03 18:49:32 -04:00
|
|
|
|
|
|
|
if isinstance(other, np.ndarray):
|
|
|
|
if other.ndim not in [2, 1]:
|
2014-09-30 20:10:36 -04:00
|
|
|
raise ValueError("Cannot call dot with %d-dimensional array" % other.ndim)
|
2015-07-03 18:49:32 -04:00
|
|
|
assert len(self) == other.shape[0], "dimension mismatch"
|
|
|
|
return np.dot(self.values, other[self.indices])
|
2014-09-30 20:10:36 -04:00
|
|
|
|
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2015-07-03 18:49:32 -04:00
|
|
|
if isinstance(other, DenseVector):
|
|
|
|
return np.dot(other.array[self.indices], self.values)
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2015-07-03 18:49:32 -04:00
|
|
|
elif isinstance(other, SparseVector):
|
2015-07-07 11:59:52 -04:00
|
|
|
# Find out common indices.
|
|
|
|
self_cmind = np.in1d(self.indices, other.indices, assume_unique=True)
|
|
|
|
self_values = self.values[self_cmind]
|
|
|
|
if self_values.size == 0:
|
|
|
|
return 0.0
|
|
|
|
else:
|
|
|
|
other_cmind = np.in1d(other.indices, self.indices, assume_unique=True)
|
|
|
|
return np.dot(self_values, other.values[other_cmind])
|
2014-09-30 20:10:36 -04:00
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
else:
|
|
|
|
return self.dot(_convert_to_vector(other))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
def squared_distance(self, other):
|
|
|
|
"""
|
|
|
|
Squared distance from a SparseVector or 1-dimensional NumPy array.
|
|
|
|
|
|
|
|
>>> a = SparseVector(4, [1, 3], [3.0, 4.0])
|
|
|
|
>>> a.squared_distance(a)
|
|
|
|
0.0
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> a.squared_distance(array.array('d', [1., 2., 3., 4.]))
|
|
|
|
11.0
|
|
|
|
>>> a.squared_distance(np.array([1., 2., 3., 4.]))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
11.0
|
2015-07-20 19:49:55 -04:00
|
|
|
>>> b = SparseVector(4, [2], [1.0])
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
>>> a.squared_distance(b)
|
2015-07-20 19:49:55 -04:00
|
|
|
26.0
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
>>> b.squared_distance(a)
|
2015-07-20 19:49:55 -04:00
|
|
|
26.0
|
2014-09-30 20:10:36 -04:00
|
|
|
>>> b.squared_distance([1., 2.])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
|
|
|
>>> b.squared_distance(SparseVector(3, [1,], [1.0,]))
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
AssertionError: dimension mismatch
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(self) == _vector_size(other), "dimension mismatch"
|
2015-07-03 18:49:32 -04:00
|
|
|
|
|
|
|
if isinstance(other, np.ndarray) or isinstance(other, DenseVector):
|
|
|
|
if isinstance(other, np.ndarray) and other.ndim != 1:
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
raise Exception("Cannot call squared_distance with %d-dimensional array" %
|
2014-05-25 20:15:01 -04:00
|
|
|
other.ndim)
|
2015-07-03 18:49:32 -04:00
|
|
|
if isinstance(other, DenseVector):
|
|
|
|
other = other.array
|
|
|
|
sparse_ind = np.zeros(other.size, dtype=bool)
|
|
|
|
sparse_ind[self.indices] = True
|
|
|
|
dist = other[sparse_ind] - self.values
|
|
|
|
result = np.dot(dist, dist)
|
|
|
|
|
|
|
|
other_ind = other[~sparse_ind]
|
|
|
|
result += np.dot(other_ind, other_ind)
|
2014-09-19 18:01:11 -04:00
|
|
|
return result
|
|
|
|
|
2015-07-03 18:49:32 -04:00
|
|
|
elif isinstance(other, SparseVector):
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
result = 0.0
|
|
|
|
i, j = 0, 0
|
|
|
|
while i < len(self.indices) and j < len(other.indices):
|
|
|
|
if self.indices[i] == other.indices[j]:
|
|
|
|
diff = self.values[i] - other.values[j]
|
|
|
|
result += diff * diff
|
|
|
|
i += 1
|
|
|
|
j += 1
|
|
|
|
elif self.indices[i] < other.indices[j]:
|
|
|
|
result += self.values[i] * self.values[i]
|
|
|
|
i += 1
|
|
|
|
else:
|
|
|
|
result += other.values[j] * other.values[j]
|
|
|
|
j += 1
|
|
|
|
while i < len(self.indices):
|
|
|
|
result += self.values[i] * self.values[i]
|
|
|
|
i += 1
|
|
|
|
while j < len(other.indices):
|
|
|
|
result += other.values[j] * other.values[j]
|
|
|
|
j += 1
|
|
|
|
return result
|
2014-09-19 18:01:11 -04:00
|
|
|
else:
|
|
|
|
return self.squared_distance(_convert_to_vector(other))
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
CC: mengxr dorx Main changes were examples to show usage across APIs.
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 21:01:39 -04:00
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Returns a copy of this SparseVector as a 1-dimensional NumPy array.
|
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
arr = np.zeros((self.size,), dtype=np.float64)
|
2014-11-24 19:37:14 -05:00
|
|
|
arr[self.indices] = self.values
|
[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
Added sc.stop() to all examples.
CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value
RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function
python/run-tests script
* Added stat.py (doc test)
CC: mengxr dorx Main changes were examples to show usage across APIs.
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 21:01:39 -04:00
|
|
|
return arr
|
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
def asML(self):
|
|
|
|
"""
|
|
|
|
Convert this vector to the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:return: :py:class:`pyspark.ml.linalg.SparseVector`
|
|
|
|
|
|
|
|
.. versionadded:: 2.0.0
|
|
|
|
"""
|
|
|
|
return newlinalg.SparseVector(self.size, self.indices, self.values)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def __len__(self):
|
|
|
|
return self.size
|
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def __str__(self):
|
2014-06-04 15:56:56 -04:00
|
|
|
inds = "[" + ",".join([str(i) for i in self.indices]) + "]"
|
|
|
|
vals = "[" + ",".join([str(v) for v in self.values]) + "]"
|
|
|
|
return "(" + ",".join((str(self.size), inds, vals)) + ")"
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
def __repr__(self):
|
|
|
|
inds = self.indices
|
|
|
|
vals = self.values
|
2014-10-28 06:50:22 -04:00
|
|
|
entries = ", ".join(["{0}: {1}".format(inds[i], _format_float(vals[i]))
|
|
|
|
for i in xrange(len(inds))])
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
return "SparseVector({0}, {{{1}}})".format(self.size, entries)
|
|
|
|
|
|
|
|
def __eq__(self, other):
|
2015-09-15 00:37:43 -04:00
|
|
|
if isinstance(other, SparseVector):
|
|
|
|
return other.size == self.size and np.array_equal(other.indices, self.indices) \
|
|
|
|
and np.array_equal(other.values, self.values)
|
|
|
|
elif isinstance(other, DenseVector):
|
|
|
|
if self.size != len(other):
|
|
|
|
return False
|
|
|
|
return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array)
|
|
|
|
return False
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2015-01-14 14:03:11 -05:00
|
|
|
def __getitem__(self, index):
|
|
|
|
inds = self.indices
|
|
|
|
vals = self.values
|
|
|
|
if not isinstance(index, int):
|
2015-04-20 13:44:09 -04:00
|
|
|
raise TypeError(
|
2015-01-14 14:03:11 -05:00
|
|
|
"Indices must be of type integer, got type %s" % type(index))
|
2015-10-16 17:36:05 -04:00
|
|
|
|
|
|
|
if index >= self.size or index < -self.size:
|
2016-10-03 20:57:54 -04:00
|
|
|
raise IndexError("Index %d out of bounds." % index)
|
2015-01-14 14:03:11 -05:00
|
|
|
if index < 0:
|
|
|
|
index += self.size
|
|
|
|
|
2015-10-16 18:53:26 -04:00
|
|
|
if (inds.size == 0) or (index > inds.item(-1)):
|
2015-10-08 21:34:15 -04:00
|
|
|
return 0.
|
|
|
|
|
2015-10-16 18:53:26 -04:00
|
|
|
insert_index = np.searchsorted(inds, index)
|
2015-01-14 14:03:11 -05:00
|
|
|
row_ind = inds[insert_index]
|
|
|
|
if row_ind == index:
|
|
|
|
return vals[insert_index]
|
|
|
|
return 0.
|
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def __ne__(self, other):
|
|
|
|
return not self.__eq__(other)
|
|
|
|
|
2015-09-15 00:37:43 -04:00
|
|
|
def __hash__(self):
|
|
|
|
result = 31 + self.size
|
|
|
|
nnz = 0
|
|
|
|
i = 0
|
|
|
|
while i < len(self.values) and nnz < 128:
|
|
|
|
if self.values[i] != 0:
|
|
|
|
result = 31 * result + int(self.indices[i])
|
|
|
|
bits = _double_to_long_bits(self.values[i])
|
|
|
|
result = 31 * result + (bits ^ (bits >> 32))
|
|
|
|
nnz += 1
|
|
|
|
i += 1
|
|
|
|
return result
|
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
class Vectors(object):
|
2014-08-06 15:58:24 -04:00
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2016-11-22 06:40:18 -05:00
|
|
|
Factory methods for working with vectors.
|
|
|
|
|
|
|
|
.. note:: Dense vectors are simply represented as NumPy array objects,
|
|
|
|
so there is no need to covert them for use in MLlib. For sparse vectors,
|
|
|
|
the factory methods in this class create an MLlib-compatible type, or users
|
|
|
|
can pass in SciPy's C{scipy.sparse} column vectors.
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def sparse(size, *args):
|
|
|
|
"""
|
|
|
|
Create a sparse vector, using either a dictionary, a list of
|
|
|
|
(index, value) pairs, or two separate arrays of indices and
|
|
|
|
values (sorted by index).
|
|
|
|
|
2014-10-07 21:09:27 -04:00
|
|
|
:param size: Size of the vector.
|
2015-09-21 01:55:24 -04:00
|
|
|
:param args: Non-zero entries, as a dictionary, list of tuples,
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
or two sorted lists containing indices and values.
|
|
|
|
|
2015-04-16 19:20:57 -04:00
|
|
|
>>> Vectors.sparse(4, {1: 1.0, 3: 5.5})
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> Vectors.sparse(4, [(1, 1.0), (3, 5.5)])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
|
|
|
>>> Vectors.sparse(4, [1, 3], [1.0, 5.5])
|
|
|
|
SparseVector(4, {1: 1.0, 3: 5.5})
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
|
|
|
return SparseVector(size, *args)
|
|
|
|
|
|
|
|
@staticmethod
|
2015-07-17 15:43:58 -04:00
|
|
|
def dense(*elements):
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2015-07-17 15:43:58 -04:00
|
|
|
Create a dense vector of 64-bit floats from a Python list or numbers.
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
>>> Vectors.dense([1, 2, 3])
|
2014-10-28 06:50:22 -04:00
|
|
|
DenseVector([1.0, 2.0, 3.0])
|
2015-07-17 15:43:58 -04:00
|
|
|
>>> Vectors.dense(1.0, 2.0)
|
|
|
|
DenseVector([1.0, 2.0])
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
"""
|
2015-07-17 15:43:58 -04:00
|
|
|
if len(elements) == 1 and not isinstance(elements[0], (float, int, long)):
|
|
|
|
# it's list, numpy.array or other iterable object.
|
|
|
|
elements = elements[0]
|
2014-09-19 18:01:11 -04:00
|
|
|
return DenseVector(elements)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
@staticmethod
|
|
|
|
def fromML(vec):
|
|
|
|
"""
|
|
|
|
Convert a vector from the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:param vec: a :py:class:`pyspark.ml.linalg.Vector`
|
|
|
|
:return: a :py:class:`pyspark.mllib.linalg.Vector`
|
|
|
|
|
|
|
|
.. versionadded:: 2.0.0
|
|
|
|
"""
|
|
|
|
if isinstance(vec, newlinalg.DenseVector):
|
|
|
|
return DenseVector(vec.array)
|
|
|
|
elif isinstance(vec, newlinalg.SparseVector):
|
|
|
|
return SparseVector(vec.size, vec.indices, vec.values)
|
|
|
|
else:
|
|
|
|
raise TypeError("Unsupported vector type %s" % type(vec))
|
|
|
|
|
2014-06-04 15:56:56 -04:00
|
|
|
@staticmethod
|
|
|
|
def stringify(vector):
|
|
|
|
"""
|
|
|
|
Converts a vector into a string, which can be recognized by
|
|
|
|
Vectors.parse().
|
|
|
|
|
|
|
|
>>> Vectors.stringify(Vectors.sparse(2, [1], [1.0]))
|
|
|
|
'(2,[1],[1.0])'
|
|
|
|
>>> Vectors.stringify(Vectors.dense([0.0, 1.0]))
|
|
|
|
'[0.0,1.0]'
|
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
return str(vector)
|
|
|
|
|
2015-05-07 17:02:05 -04:00
|
|
|
@staticmethod
|
|
|
|
def squared_distance(v1, v2):
|
|
|
|
"""
|
|
|
|
Squared distance between two vectors.
|
|
|
|
a and b can be of type SparseVector, DenseVector, np.ndarray
|
|
|
|
or array.array.
|
|
|
|
|
|
|
|
>>> a = Vectors.sparse(4, [(0, 1), (3, 4)])
|
|
|
|
>>> b = Vectors.dense([2, 5, 4, 1])
|
|
|
|
>>> a.squared_distance(b)
|
|
|
|
51.0
|
|
|
|
"""
|
|
|
|
v1, v2 = _convert_to_vector(v1), _convert_to_vector(v2)
|
|
|
|
return v1.squared_distance(v2)
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def norm(vector, p):
|
|
|
|
"""
|
|
|
|
Find norm of the given vector.
|
|
|
|
"""
|
|
|
|
return _convert_to_vector(vector).norm(p)
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def parse(s):
|
|
|
|
"""Parse a string representation back into the Vector.
|
|
|
|
|
|
|
|
>>> Vectors.parse('[2,1,2 ]')
|
|
|
|
DenseVector([2.0, 1.0, 2.0])
|
|
|
|
>>> Vectors.parse(' ( 100, [0], [2])')
|
|
|
|
SparseVector(100, {0: 2.0})
|
|
|
|
"""
|
|
|
|
if s.find('(') == -1 and s.find('[') != -1:
|
|
|
|
return DenseVector.parse(s)
|
|
|
|
elif s.find('(') != -1:
|
|
|
|
return SparseVector.parse(s)
|
|
|
|
else:
|
|
|
|
raise ValueError(
|
|
|
|
"Cannot find tokens '[' or '(' from the input string.")
|
|
|
|
|
|
|
|
@staticmethod
|
|
|
|
def zeros(size):
|
|
|
|
return DenseVector(np.zeros(size))
|
|
|
|
|
2015-09-15 00:37:43 -04:00
|
|
|
@staticmethod
|
|
|
|
def _equals(v1_indices, v1_values, v2_indices, v2_values):
|
|
|
|
"""
|
|
|
|
Check equality between sparse/dense vectors,
|
|
|
|
v1_indices and v2_indices assume to be strictly increasing.
|
|
|
|
"""
|
|
|
|
v1_size = len(v1_values)
|
|
|
|
v2_size = len(v2_values)
|
|
|
|
k1 = 0
|
|
|
|
k2 = 0
|
|
|
|
all_equal = True
|
|
|
|
while all_equal:
|
|
|
|
while k1 < v1_size and v1_values[k1] == 0:
|
|
|
|
k1 += 1
|
|
|
|
while k2 < v2_size and v2_values[k2] == 0:
|
|
|
|
k2 += 1
|
|
|
|
|
|
|
|
if k1 >= v1_size or k2 >= v2_size:
|
|
|
|
return k1 >= v1_size and k2 >= v2_size
|
|
|
|
|
|
|
|
all_equal = v1_indices[k1] == v2_indices[k2] and v1_values[k1] == v2_values[k2]
|
|
|
|
k1 += 1
|
|
|
|
k2 += 1
|
|
|
|
return all_equal
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class Matrix(object):
|
2015-06-17 14:10:16 -04:00
|
|
|
|
|
|
|
__UDT__ = MatrixUDT()
|
|
|
|
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
|
|
|
Represents a local matrix.
|
|
|
|
"""
|
2015-04-21 17:36:50 -04:00
|
|
|
def __init__(self, numRows, numCols, isTransposed=False):
|
2014-09-30 20:10:36 -04:00
|
|
|
self.numRows = numRows
|
|
|
|
self.numCols = numCols
|
2015-04-21 17:36:50 -04:00
|
|
|
self.isTransposed = isTransposed
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
def toArray(self):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
|
|
|
Returns its elements in a NumPy ndarray.
|
|
|
|
"""
|
2014-09-19 18:01:11 -04:00
|
|
|
raise NotImplementedError
|
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
def asML(self):
|
|
|
|
"""
|
|
|
|
Convert this matrix to the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
"""
|
|
|
|
raise NotImplementedError
|
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
@staticmethod
|
|
|
|
def _convert_to_array(array_like, dtype):
|
|
|
|
"""
|
|
|
|
Convert Matrix attributes which are array-like or buffer to array.
|
|
|
|
"""
|
2015-04-16 19:20:57 -04:00
|
|
|
if isinstance(array_like, bytes):
|
2015-04-10 02:10:13 -04:00
|
|
|
return np.frombuffer(array_like, dtype=dtype)
|
|
|
|
return np.asarray(array_like, dtype=dtype)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
|
|
|
|
class DenseMatrix(Matrix):
|
2014-09-30 20:10:36 -04:00
|
|
|
"""
|
|
|
|
Column-major dense matrix.
|
|
|
|
"""
|
2015-04-21 17:36:50 -04:00
|
|
|
def __init__(self, numRows, numCols, values, isTransposed=False):
|
|
|
|
Matrix.__init__(self, numRows, numCols, isTransposed)
|
2015-04-10 02:10:13 -04:00
|
|
|
values = self._convert_to_array(values, np.float64)
|
2014-09-30 20:10:36 -04:00
|
|
|
assert len(values) == numRows * numCols
|
2014-09-19 18:01:11 -04:00
|
|
|
self.values = values
|
|
|
|
|
|
|
|
def __reduce__(self):
|
2015-04-21 17:36:50 -04:00
|
|
|
return DenseMatrix, (
|
|
|
|
self.numRows, self.numCols, self.values.tostring(),
|
|
|
|
int(self.isTransposed))
|
2014-09-19 18:01:11 -04:00
|
|
|
|
2015-07-08 16:19:27 -04:00
|
|
|
def __str__(self):
|
|
|
|
"""
|
|
|
|
Pretty printing of a DenseMatrix
|
|
|
|
|
|
|
|
>>> dm = DenseMatrix(2, 2, range(4))
|
|
|
|
>>> print(dm)
|
|
|
|
DenseMatrix([[ 0., 2.],
|
|
|
|
[ 1., 3.]])
|
|
|
|
>>> dm = DenseMatrix(2, 2, range(4), isTransposed=True)
|
|
|
|
>>> print(dm)
|
|
|
|
DenseMatrix([[ 0., 1.],
|
|
|
|
[ 2., 3.]])
|
|
|
|
"""
|
|
|
|
# Inspired by __repr__ in scipy matrices.
|
|
|
|
array_lines = repr(self.toArray()).splitlines()
|
|
|
|
|
|
|
|
# We need to adjust six spaces which is the difference in number
|
|
|
|
# of letters between "DenseMatrix" and "array"
|
|
|
|
x = '\n'.join([(" " * 6 + line) for line in array_lines[1:]])
|
|
|
|
return array_lines[0].replace("array", "DenseMatrix") + "\n" + x
|
|
|
|
|
|
|
|
def __repr__(self):
|
|
|
|
"""
|
|
|
|
Representation of a DenseMatrix
|
|
|
|
|
|
|
|
>>> dm = DenseMatrix(2, 2, range(4))
|
|
|
|
>>> dm
|
|
|
|
DenseMatrix(2, 2, [0.0, 1.0, 2.0, 3.0], False)
|
|
|
|
"""
|
|
|
|
# If the number of values are less than seventeen then return as it is.
|
|
|
|
# Else return first eight values and last eight values.
|
|
|
|
if len(self.values) < 17:
|
|
|
|
entries = _format_float_list(self.values)
|
|
|
|
else:
|
|
|
|
entries = (
|
|
|
|
_format_float_list(self.values[:8]) +
|
|
|
|
["..."] +
|
|
|
|
_format_float_list(self.values[-8:])
|
|
|
|
)
|
|
|
|
|
|
|
|
entries = ", ".join(entries)
|
|
|
|
return "DenseMatrix({0}, {1}, [{2}], {3})".format(
|
|
|
|
self.numRows, self.numCols, entries, self.isTransposed)
|
|
|
|
|
2014-09-19 18:01:11 -04:00
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Return an numpy.ndarray
|
|
|
|
|
2014-11-24 19:37:14 -05:00
|
|
|
>>> m = DenseMatrix(2, 2, range(4))
|
2014-09-19 18:01:11 -04:00
|
|
|
>>> m.toArray()
|
2014-09-30 20:10:36 -04:00
|
|
|
array([[ 0., 2.],
|
|
|
|
[ 1., 3.]])
|
2014-09-19 18:01:11 -04:00
|
|
|
"""
|
2015-04-21 17:36:50 -04:00
|
|
|
if self.isTransposed:
|
|
|
|
return np.asfortranarray(
|
|
|
|
self.values.reshape((self.numRows, self.numCols)))
|
|
|
|
else:
|
|
|
|
return self.values.reshape((self.numRows, self.numCols), order='F')
|
2014-11-24 19:37:14 -05:00
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
def toSparse(self):
|
|
|
|
"""Convert to SparseMatrix"""
|
2015-04-21 17:36:50 -04:00
|
|
|
if self.isTransposed:
|
|
|
|
values = np.ravel(self.toArray(), order='F')
|
|
|
|
else:
|
|
|
|
values = self.values
|
|
|
|
indices = np.nonzero(values)[0]
|
2015-04-16 19:20:57 -04:00
|
|
|
colCounts = np.bincount(indices // self.numRows)
|
2015-04-10 02:10:13 -04:00
|
|
|
colPtrs = np.cumsum(np.hstack(
|
|
|
|
(0, colCounts, np.zeros(self.numCols - colCounts.size))))
|
2015-04-21 17:36:50 -04:00
|
|
|
values = values[indices]
|
2015-04-10 02:10:13 -04:00
|
|
|
rowIndices = indices % self.numRows
|
|
|
|
|
|
|
|
return SparseMatrix(self.numRows, self.numCols, colPtrs, rowIndices, values)
|
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
def asML(self):
|
|
|
|
"""
|
|
|
|
Convert this matrix to the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:return: :py:class:`pyspark.ml.linalg.DenseMatrix`
|
|
|
|
|
|
|
|
.. versionadded:: 2.0.0
|
|
|
|
"""
|
|
|
|
return newlinalg.DenseMatrix(self.numRows, self.numCols, self.values, self.isTransposed)
|
|
|
|
|
2015-04-01 20:03:39 -04:00
|
|
|
def __getitem__(self, indices):
|
|
|
|
i, j = indices
|
|
|
|
if i < 0 or i >= self.numRows:
|
2016-10-03 20:57:54 -04:00
|
|
|
raise IndexError("Row index %d is out of range [0, %d)"
|
2015-04-01 20:03:39 -04:00
|
|
|
% (i, self.numRows))
|
|
|
|
if j >= self.numCols or j < 0:
|
2016-10-03 20:57:54 -04:00
|
|
|
raise IndexError("Column index %d is out of range [0, %d)"
|
2015-04-01 20:03:39 -04:00
|
|
|
% (j, self.numCols))
|
2015-04-21 17:36:50 -04:00
|
|
|
|
|
|
|
if self.isTransposed:
|
|
|
|
return self.values[i * self.numCols + j]
|
|
|
|
else:
|
|
|
|
return self.values[i + j * self.numRows]
|
2015-04-01 20:03:39 -04:00
|
|
|
|
2014-11-24 19:37:14 -05:00
|
|
|
def __eq__(self, other):
|
2015-04-21 17:36:50 -04:00
|
|
|
if (not isinstance(other, DenseMatrix) or
|
|
|
|
self.numRows != other.numRows or
|
|
|
|
self.numCols != other.numCols):
|
|
|
|
return False
|
|
|
|
|
|
|
|
self_values = np.ravel(self.toArray(), order='F')
|
|
|
|
other_values = np.ravel(other.toArray(), order='F')
|
|
|
|
return all(self_values == other_values)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
2014-07-22 01:30:53 -04:00
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
class SparseMatrix(Matrix):
|
|
|
|
"""Sparse Matrix stored in CSC format."""
|
|
|
|
def __init__(self, numRows, numCols, colPtrs, rowIndices, values,
|
|
|
|
isTransposed=False):
|
2015-04-21 17:36:50 -04:00
|
|
|
Matrix.__init__(self, numRows, numCols, isTransposed)
|
2015-04-10 02:10:13 -04:00
|
|
|
self.colPtrs = self._convert_to_array(colPtrs, np.int32)
|
|
|
|
self.rowIndices = self._convert_to_array(rowIndices, np.int32)
|
|
|
|
self.values = self._convert_to_array(values, np.float64)
|
|
|
|
|
|
|
|
if self.isTransposed:
|
|
|
|
if self.colPtrs.size != numRows + 1:
|
|
|
|
raise ValueError("Expected colPtrs of size %d, got %d."
|
|
|
|
% (numRows + 1, self.colPtrs.size))
|
|
|
|
else:
|
|
|
|
if self.colPtrs.size != numCols + 1:
|
|
|
|
raise ValueError("Expected colPtrs of size %d, got %d."
|
|
|
|
% (numCols + 1, self.colPtrs.size))
|
|
|
|
if self.rowIndices.size != self.values.size:
|
|
|
|
raise ValueError("Expected rowIndices of length %d, got %d."
|
|
|
|
% (self.rowIndices.size, self.values.size))
|
|
|
|
|
2015-07-08 16:19:27 -04:00
|
|
|
def __str__(self):
|
|
|
|
"""
|
|
|
|
Pretty printing of a SparseMatrix
|
|
|
|
|
|
|
|
>>> sm1 = SparseMatrix(2, 2, [0, 2, 3], [0, 1, 1], [2, 3, 4])
|
|
|
|
>>> print(sm1)
|
|
|
|
2 X 2 CSCMatrix
|
|
|
|
(0,0) 2.0
|
|
|
|
(1,0) 3.0
|
|
|
|
(1,1) 4.0
|
|
|
|
>>> sm1 = SparseMatrix(2, 2, [0, 2, 3], [0, 1, 1], [2, 3, 4], True)
|
|
|
|
>>> print(sm1)
|
|
|
|
2 X 2 CSRMatrix
|
|
|
|
(0,0) 2.0
|
|
|
|
(0,1) 3.0
|
|
|
|
(1,1) 4.0
|
|
|
|
"""
|
|
|
|
spstr = "{0} X {1} ".format(self.numRows, self.numCols)
|
|
|
|
if self.isTransposed:
|
|
|
|
spstr += "CSRMatrix\n"
|
|
|
|
else:
|
|
|
|
spstr += "CSCMatrix\n"
|
|
|
|
|
|
|
|
cur_col = 0
|
|
|
|
smlist = []
|
|
|
|
|
|
|
|
# Display first 16 values.
|
|
|
|
if len(self.values) <= 16:
|
|
|
|
zipindval = zip(self.rowIndices, self.values)
|
|
|
|
else:
|
|
|
|
zipindval = zip(self.rowIndices[:16], self.values[:16])
|
|
|
|
for i, (rowInd, value) in enumerate(zipindval):
|
|
|
|
if self.colPtrs[cur_col + 1] <= i:
|
|
|
|
cur_col += 1
|
|
|
|
if self.isTransposed:
|
|
|
|
smlist.append('({0},{1}) {2}'.format(
|
|
|
|
cur_col, rowInd, _format_float(value)))
|
|
|
|
else:
|
|
|
|
smlist.append('({0},{1}) {2}'.format(
|
|
|
|
rowInd, cur_col, _format_float(value)))
|
|
|
|
spstr += "\n".join(smlist)
|
|
|
|
|
|
|
|
if len(self.values) > 16:
|
|
|
|
spstr += "\n.." * 2
|
|
|
|
return spstr
|
|
|
|
|
|
|
|
def __repr__(self):
|
|
|
|
"""
|
|
|
|
Representation of a SparseMatrix
|
|
|
|
|
|
|
|
>>> sm1 = SparseMatrix(2, 2, [0, 2, 3], [0, 1, 1], [2, 3, 4])
|
|
|
|
>>> sm1
|
|
|
|
SparseMatrix(2, 2, [0, 2, 3], [0, 1, 1], [2.0, 3.0, 4.0], False)
|
|
|
|
"""
|
|
|
|
rowIndices = list(self.rowIndices)
|
|
|
|
colPtrs = list(self.colPtrs)
|
|
|
|
|
|
|
|
if len(self.values) <= 16:
|
|
|
|
values = _format_float_list(self.values)
|
|
|
|
|
|
|
|
else:
|
|
|
|
values = (
|
|
|
|
_format_float_list(self.values[:8]) +
|
|
|
|
["..."] +
|
|
|
|
_format_float_list(self.values[-8:])
|
|
|
|
)
|
|
|
|
rowIndices = rowIndices[:8] + ["..."] + rowIndices[-8:]
|
|
|
|
|
|
|
|
if len(self.colPtrs) > 16:
|
|
|
|
colPtrs = colPtrs[:8] + ["..."] + colPtrs[-8:]
|
|
|
|
|
|
|
|
values = ", ".join(values)
|
|
|
|
rowIndices = ", ".join([str(ind) for ind in rowIndices])
|
|
|
|
colPtrs = ", ".join([str(ptr) for ptr in colPtrs])
|
|
|
|
return "SparseMatrix({0}, {1}, [{2}], [{3}], [{4}], {5})".format(
|
|
|
|
self.numRows, self.numCols, colPtrs, rowIndices,
|
|
|
|
values, self.isTransposed)
|
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
def __reduce__(self):
|
|
|
|
return SparseMatrix, (
|
|
|
|
self.numRows, self.numCols, self.colPtrs.tostring(),
|
|
|
|
self.rowIndices.tostring(), self.values.tostring(),
|
2015-05-05 10:53:11 -04:00
|
|
|
int(self.isTransposed))
|
2015-04-10 02:10:13 -04:00
|
|
|
|
|
|
|
def __getitem__(self, indices):
|
|
|
|
i, j = indices
|
|
|
|
if i < 0 or i >= self.numRows:
|
2016-10-03 20:57:54 -04:00
|
|
|
raise IndexError("Row index %d is out of range [0, %d)"
|
2015-04-10 02:10:13 -04:00
|
|
|
% (i, self.numRows))
|
|
|
|
if j < 0 or j >= self.numCols:
|
2016-10-03 20:57:54 -04:00
|
|
|
raise IndexError("Column index %d is out of range [0, %d)"
|
2015-04-10 02:10:13 -04:00
|
|
|
% (j, self.numCols))
|
|
|
|
|
|
|
|
# If a CSR matrix is given, then the row index should be searched
|
|
|
|
# for in ColPtrs, and the column index should be searched for in the
|
|
|
|
# corresponding slice obtained from rowIndices.
|
|
|
|
if self.isTransposed:
|
|
|
|
j, i = i, j
|
|
|
|
|
|
|
|
colStart = self.colPtrs[j]
|
|
|
|
colEnd = self.colPtrs[j + 1]
|
|
|
|
nz = self.rowIndices[colStart: colEnd]
|
|
|
|
ind = np.searchsorted(nz, i) + colStart
|
|
|
|
if ind < colEnd and self.rowIndices[ind] == i:
|
|
|
|
return self.values[ind]
|
|
|
|
else:
|
|
|
|
return 0.0
|
|
|
|
|
|
|
|
def toArray(self):
|
|
|
|
"""
|
|
|
|
Return an numpy.ndarray
|
|
|
|
"""
|
|
|
|
A = np.zeros((self.numRows, self.numCols), dtype=np.float64, order='F')
|
|
|
|
for k in xrange(self.colPtrs.size - 1):
|
|
|
|
startptr = self.colPtrs[k]
|
|
|
|
endptr = self.colPtrs[k + 1]
|
|
|
|
if self.isTransposed:
|
|
|
|
A[k, self.rowIndices[startptr:endptr]] = self.values[startptr:endptr]
|
|
|
|
else:
|
|
|
|
A[self.rowIndices[startptr:endptr], k] = self.values[startptr:endptr]
|
|
|
|
return A
|
|
|
|
|
|
|
|
def toDense(self):
|
2015-04-21 17:36:50 -04:00
|
|
|
densevals = np.ravel(self.toArray(), order='F')
|
2015-04-10 02:10:13 -04:00
|
|
|
return DenseMatrix(self.numRows, self.numCols, densevals)
|
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
def asML(self):
|
|
|
|
"""
|
|
|
|
Convert this matrix to the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:return: :py:class:`pyspark.ml.linalg.SparseMatrix`
|
|
|
|
|
|
|
|
.. versionadded:: 2.0.0
|
|
|
|
"""
|
|
|
|
return newlinalg.SparseMatrix(self.numRows, self.numCols, self.colPtrs, self.rowIndices,
|
|
|
|
self.values, self.isTransposed)
|
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
# TODO: More efficient implementation:
|
|
|
|
def __eq__(self, other):
|
2015-05-05 10:53:11 -04:00
|
|
|
return np.all(self.toArray() == other.toArray())
|
2015-04-10 02:10:13 -04:00
|
|
|
|
|
|
|
|
[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes #3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-05 00:35:52 -05:00
|
|
|
class Matrices(object):
|
|
|
|
@staticmethod
|
|
|
|
def dense(numRows, numCols, values):
|
|
|
|
"""
|
|
|
|
Create a DenseMatrix
|
|
|
|
"""
|
|
|
|
return DenseMatrix(numRows, numCols, values)
|
|
|
|
|
2015-04-10 02:10:13 -04:00
|
|
|
@staticmethod
|
|
|
|
def sparse(numRows, numCols, colPtrs, rowIndices, values):
|
|
|
|
"""
|
|
|
|
Create a SparseMatrix
|
|
|
|
"""
|
|
|
|
return SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)
|
|
|
|
|
2016-06-30 20:52:15 -04:00
|
|
|
@staticmethod
|
|
|
|
def fromML(mat):
|
|
|
|
"""
|
|
|
|
Convert a matrix from the new mllib-local representation.
|
|
|
|
This does NOT copy the data; it copies references.
|
|
|
|
|
|
|
|
:param mat: a :py:class:`pyspark.ml.linalg.Matrix`
|
|
|
|
:return: a :py:class:`pyspark.mllib.linalg.Matrix`
|
|
|
|
|
|
|
|
.. versionadded:: 2.0.0
|
|
|
|
"""
|
|
|
|
if isinstance(mat, newlinalg.DenseMatrix):
|
|
|
|
return DenseMatrix(mat.numRows, mat.numCols, mat.values, mat.isTransposed)
|
|
|
|
elif isinstance(mat, newlinalg.SparseMatrix):
|
|
|
|
return SparseMatrix(mat.numRows, mat.numCols, mat.colPtrs, mat.rowIndices,
|
|
|
|
mat.values, mat.isTransposed)
|
|
|
|
else:
|
|
|
|
raise TypeError("Unsupported matrix type %s" % type(mat))
|
|
|
|
|
[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes #3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-05 00:35:52 -05:00
|
|
|
|
[SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
* `RowMatrix` <sup>**[1]**</sup>
1. `computeGramianMatrix`
2. `computeCovariance`
3. `computeColumnSummaryStatistics`
4. `columnSimilarities`
5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
1. `computeGramianMatrix`
* `CoordinateMatrix`
1. `transpose`
* `BlockMatrix`
1. `validate`
2. `cache`
3. `persist`
4. `transpose`
**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 13:48:05 -04:00
|
|
|
class QRDecomposition(object):
|
|
|
|
"""
|
|
|
|
Represents QR factors.
|
|
|
|
"""
|
|
|
|
def __init__(self, Q, R):
|
|
|
|
self._Q = Q
|
|
|
|
self._R = R
|
|
|
|
|
|
|
|
@property
|
|
|
|
@since('2.0.0')
|
|
|
|
def Q(self):
|
|
|
|
"""
|
|
|
|
An orthogonal matrix Q in a QR decomposition.
|
|
|
|
May be null if not computed.
|
|
|
|
"""
|
|
|
|
return self._Q
|
|
|
|
|
|
|
|
@property
|
|
|
|
@since('2.0.0')
|
|
|
|
def R(self):
|
|
|
|
"""
|
|
|
|
An upper triangular matrix R in a QR decomposition.
|
|
|
|
"""
|
|
|
|
return self._R
|
|
|
|
|
|
|
|
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
def _test():
|
|
|
|
import doctest
|
[SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+
## What changes were proposed in this pull request?
This PR proposes to make PySpark's tests compatible with NumPy 0.14+
NumPy 0.14.x introduced rather radical changes about its string representation.
For example, the tests below are failed:
```
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 895, in __main__.DenseMatrix.__str__
Failed example:
print(dm)
Expected:
DenseMatrix([[ 0., 2.],
[ 1., 3.]])
Got:
DenseMatrix([[0., 2.],
[1., 3.]])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 899, in __main__.DenseMatrix.__str__
Failed example:
print(dm)
Expected:
DenseMatrix([[ 0., 1.],
[ 2., 3.]])
Got:
DenseMatrix([[0., 1.],
[2., 3.]])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 939, in __main__.DenseMatrix.toArray
Failed example:
m.toArray()
Expected:
array([[ 0., 2.],
[ 1., 3.]])
Got:
array([[0., 2.],
[1., 3.]])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 324, in __main__.DenseVector.dot
Failed example:
dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F'))
Expected:
array([ 5., 11.])
Got:
array([ 5., 11.])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 567, in __main__.SparseVector.dot
Failed example:
a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]]))
Expected:
array([ 22., 22.])
Got:
array([22., 22.])
```
See [release note](https://docs.scipy.org/doc/numpy-1.14.0/release.html#compatibility-notes).
## How was this patch tested?
Manually tested:
```
$ ./run-tests --python-executables=python3.6,python2.7 --modules=pyspark-ml,pyspark-mllib
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3.6', 'python2.7']
Will test the following Python modules: ['pyspark-ml', 'pyspark-mllib']
Starting test(python2.7): pyspark.mllib.tests
Starting test(python2.7): pyspark.ml.classification
Starting test(python3.6): pyspark.mllib.tests
Starting test(python2.7): pyspark.ml.clustering
Finished test(python2.7): pyspark.ml.clustering (54s)
Starting test(python2.7): pyspark.ml.evaluation
Finished test(python2.7): pyspark.ml.classification (74s)
Starting test(python2.7): pyspark.ml.feature
Finished test(python2.7): pyspark.ml.evaluation (27s)
Starting test(python2.7): pyspark.ml.fpm
Finished test(python2.7): pyspark.ml.fpm (0s)
Starting test(python2.7): pyspark.ml.image
Finished test(python2.7): pyspark.ml.image (17s)
Starting test(python2.7): pyspark.ml.linalg.__init__
Finished test(python2.7): pyspark.ml.linalg.__init__ (1s)
Starting test(python2.7): pyspark.ml.recommendation
Finished test(python2.7): pyspark.ml.feature (76s)
Starting test(python2.7): pyspark.ml.regression
Finished test(python2.7): pyspark.ml.recommendation (69s)
Starting test(python2.7): pyspark.ml.stat
Finished test(python2.7): pyspark.ml.regression (45s)
Starting test(python2.7): pyspark.ml.tests
Finished test(python2.7): pyspark.ml.stat (28s)
Starting test(python2.7): pyspark.ml.tuning
Finished test(python2.7): pyspark.ml.tuning (20s)
Starting test(python2.7): pyspark.mllib.classification
Finished test(python2.7): pyspark.mllib.classification (31s)
Starting test(python2.7): pyspark.mllib.clustering
Finished test(python2.7): pyspark.mllib.tests (260s)
Starting test(python2.7): pyspark.mllib.evaluation
Finished test(python3.6): pyspark.mllib.tests (266s)
Starting test(python2.7): pyspark.mllib.feature
Finished test(python2.7): pyspark.mllib.evaluation (21s)
Starting test(python2.7): pyspark.mllib.fpm
Finished test(python2.7): pyspark.mllib.feature (38s)
Starting test(python2.7): pyspark.mllib.linalg.__init__
Finished test(python2.7): pyspark.mllib.linalg.__init__ (1s)
Starting test(python2.7): pyspark.mllib.linalg.distributed
Finished test(python2.7): pyspark.mllib.fpm (34s)
Starting test(python2.7): pyspark.mllib.random
Finished test(python2.7): pyspark.mllib.clustering (64s)
Starting test(python2.7): pyspark.mllib.recommendation
Finished test(python2.7): pyspark.mllib.random (15s)
Starting test(python2.7): pyspark.mllib.regression
Finished test(python2.7): pyspark.mllib.linalg.distributed (47s)
Starting test(python2.7): pyspark.mllib.stat.KernelDensity
Finished test(python2.7): pyspark.mllib.stat.KernelDensity (0s)
Starting test(python2.7): pyspark.mllib.stat._statistics
Finished test(python2.7): pyspark.mllib.recommendation (40s)
Starting test(python2.7): pyspark.mllib.tree
Finished test(python2.7): pyspark.mllib.regression (38s)
Starting test(python2.7): pyspark.mllib.util
Finished test(python2.7): pyspark.mllib.stat._statistics (19s)
Starting test(python3.6): pyspark.ml.classification
Finished test(python2.7): pyspark.mllib.tree (26s)
Starting test(python3.6): pyspark.ml.clustering
Finished test(python2.7): pyspark.mllib.util (27s)
Starting test(python3.6): pyspark.ml.evaluation
Finished test(python3.6): pyspark.ml.evaluation (30s)
Starting test(python3.6): pyspark.ml.feature
Finished test(python2.7): pyspark.ml.tests (234s)
Starting test(python3.6): pyspark.ml.fpm
Finished test(python3.6): pyspark.ml.fpm (1s)
Starting test(python3.6): pyspark.ml.image
Finished test(python3.6): pyspark.ml.clustering (55s)
Starting test(python3.6): pyspark.ml.linalg.__init__
Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
Starting test(python3.6): pyspark.ml.recommendation
Finished test(python3.6): pyspark.ml.classification (71s)
Starting test(python3.6): pyspark.ml.regression
Finished test(python3.6): pyspark.ml.image (18s)
Starting test(python3.6): pyspark.ml.stat
Finished test(python3.6): pyspark.ml.stat (37s)
Starting test(python3.6): pyspark.ml.tests
Finished test(python3.6): pyspark.ml.regression (59s)
Starting test(python3.6): pyspark.ml.tuning
Finished test(python3.6): pyspark.ml.feature (93s)
Starting test(python3.6): pyspark.mllib.classification
Finished test(python3.6): pyspark.ml.recommendation (83s)
Starting test(python3.6): pyspark.mllib.clustering
Finished test(python3.6): pyspark.ml.tuning (29s)
Starting test(python3.6): pyspark.mllib.evaluation
Finished test(python3.6): pyspark.mllib.evaluation (26s)
Starting test(python3.6): pyspark.mllib.feature
Finished test(python3.6): pyspark.mllib.classification (43s)
Starting test(python3.6): pyspark.mllib.fpm
Finished test(python3.6): pyspark.mllib.clustering (81s)
Starting test(python3.6): pyspark.mllib.linalg.__init__
Finished test(python3.6): pyspark.mllib.linalg.__init__ (2s)
Starting test(python3.6): pyspark.mllib.linalg.distributed
Finished test(python3.6): pyspark.mllib.fpm (48s)
Starting test(python3.6): pyspark.mllib.random
Finished test(python3.6): pyspark.mllib.feature (54s)
Starting test(python3.6): pyspark.mllib.recommendation
Finished test(python3.6): pyspark.mllib.random (18s)
Starting test(python3.6): pyspark.mllib.regression
Finished test(python3.6): pyspark.mllib.linalg.distributed (55s)
Starting test(python3.6): pyspark.mllib.stat.KernelDensity
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python3.6): pyspark.mllib.stat._statistics
Finished test(python3.6): pyspark.mllib.recommendation (51s)
Starting test(python3.6): pyspark.mllib.tree
Finished test(python3.6): pyspark.mllib.regression (45s)
Starting test(python3.6): pyspark.mllib.util
Finished test(python3.6): pyspark.mllib.stat._statistics (21s)
Finished test(python3.6): pyspark.mllib.tree (27s)
Finished test(python3.6): pyspark.mllib.util (27s)
Finished test(python3.6): pyspark.ml.tests (264s)
```
Author: hyukjinkwon <gurwls223@apache.org>
Closes #21715 from HyukjinKwon/SPARK-24740.
2018-07-06 23:39:29 -04:00
|
|
|
import numpy
|
|
|
|
try:
|
|
|
|
# Numpy 1.14+ changed it's string format.
|
|
|
|
numpy.set_printoptions(legacy='1.13')
|
|
|
|
except TypeError:
|
|
|
|
pass
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
(failure_count, test_count) = doctest.testmod(optionflags=doctest.ELLIPSIS)
|
|
|
|
if failure_count:
|
2018-03-08 06:38:34 -05:00
|
|
|
sys.exit(-1)
|
[WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.
On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.
Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.
CC @mengxr, @joshrosen
Author: Matei Zaharia <matei@databricks.com>
Closes #341 from mateiz/py-ml-update and squashes the following commits:
d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 23:33:24 -04:00
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
_test()
|