Matei Zaharia 63ca581d9c [WIP] SPARK-1430: Support sparse data in Python MLlib

This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.

On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.

Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.

CC @mengxr, @joshrosen

Author: Matei Zaharia <matei@databricks.com>

Closes #341 from mateiz/py-ml-update and squashes the following commits:

d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact

2014-04-15 20:33:24 -07:00

2.9 KiB

Raw Blame History

layout	title
global	Machine Learning Library (MLlib)

MLlib is a Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators. MLlib currently supports four common types of machine learning problem settings, namely classification, regression, clustering and collaborative filtering, as well as an underlying gradient descent optimization primitive and several linear algebra methods.

Available Methods

The following links provide a detailed explanation of the methods and usage examples for each of them:

Classification and Regression
- Binary Classification
  - SVM (L1 and L2 regularized)
  - Logistic Regression (L1 and L2 regularized)
- Linear Regression
  - Least Squares
  - Lasso
  - Ridge Regression
- Decision Tree (for classification and regression)
Clustering
- k-Means
Collaborative Filtering
- Matrix Factorization using Alternating Least Squares
Optimization
- Gradient Descent and Stochastic Gradient Descent
Linear Algebra
- Singular Value Decomposition
- Principal Component Analysis

Data Types

Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the Vector class is used to represent vectors. You can create either dense or sparse vectors using the Vectors factory.

In Python, MLlib can take the following vector types:

NumPy arrays
Standard Python lists (e.g. [1, 2, 3])
The MLlib SparseVector class
SciPy sparse matrices

For efficiency, we recommend using NumPy arrays over lists, and using the CSC format for SciPy matrices, or MLlib's own SparseVector class.

Several other simple data types are used throughout the library, e.g. the LabeledPoint class (Java/Scala, Python) for labeled data.

Dependencies

MLlib uses the jblas linear algebra library, which itself depends on native Fortran routines. You may need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.

To use MLlib in Python, you will need NumPy version 1.4 or newer.

2.9 KiB Raw Blame History

Available Methods

Data Types

Dependencies

2.9 KiB

Raw Blame History