2013-08-31 17:21:10 -04:00
---
layout: global
title: Machine Learning Library (MLlib)
---
2014-08-12 20:15:21 -04:00
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
2014-04-22 14:20:47 -04:00
including classification, regression, clustering, collaborative
2014-08-12 20:15:21 -04:00
filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
2014-01-03 19:38:33 -05:00
2014-08-12 20:15:21 -04:00
* [Data types ](mllib-basics.html )
* [Basic statistics ](mllib-stats.html )
2014-08-19 19:06:48 -04:00
* random data generation
2014-08-12 20:15:21 -04:00
* stratified sampling
2014-04-22 14:20:47 -04:00
* summary statistics
2014-08-12 20:15:21 -04:00
* hypothesis testing
* [Classification and regression ](mllib-classification-regression.html )
* [linear models (SVMs, logistic regression, linear regression) ](mllib-linear-methods.html )
* [decision trees ](mllib-decision-tree.html )
2014-04-22 14:20:47 -04:00
* [naive Bayes ](mllib-naive-bayes.html )
* [Collaborative filtering ](mllib-collaborative-filtering.html )
* alternating least squares (ALS)
* [Clustering ](mllib-clustering.html )
* k-means
* [Dimensionality reduction ](mllib-dimensionality-reduction.html )
* singular value decomposition (SVD)
* principal component analysis (PCA)
2014-08-12 20:15:21 -04:00
* [Feature extraction and transformation ](mllib-feature-extraction.html )
* [Optimization (developer) ](mllib-optimization.html )
2014-04-22 14:20:47 -04:00
* stochastic gradient descent
* limited-memory BFGS (L-BFGS)
2014-08-12 20:15:21 -04:00
MLlib is under active development.
2014-05-18 20:00:57 -04:00
The APIs marked `Experimental` /`DeveloperApi` may change in future releases,
2014-08-12 20:15:21 -04:00
and the migration guide below will explain all changes between releases.
2014-04-22 14:20:47 -04:00
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
# Dependencies
2014-04-22 14:20:47 -04:00
2014-08-12 20:15:21 -04:00
MLlib uses the linear algebra package [Breeze ](http://www.scalanlp.org/ ), which depends on
2014-04-22 14:20:47 -04:00
[netlib-java ](https://github.com/fommil/netlib-java ), and
[jblas ](https://github.com/mikiobraun/jblas ).
`netlib-java` and `jblas` depend on native Fortran routines.
You need to install the
[gfortran runtime library ](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries ) if it is not
already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries
automatically. Due to license issues, we do not include `netlib-java` 's native libraries in MLlib's
dependency set. If no native library is available at runtime, you will see a warning message. To
use native libraries from `netlib-java` , please include artifact
`com.github.fommil.netlib:all:1.1.2` as a dependency of your project or build your own (see
[instructions ](https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries )).
2013-09-10 00:45:04 -04:00
2014-04-15 03:19:43 -04:00
To use MLlib in Python, you will need [NumPy ](http://www.numpy.org ) version 1.4 or newer.
2014-04-22 14:20:47 -04:00
---
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
# Migration Guide
2014-04-22 14:20:47 -04:00
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
## From 0.9 to 1.0
2014-04-22 14:20:47 -04:00
In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
2014-08-12 20:15:21 -04:00
take advantage of sparsity in both storage and computation. Details are described below.
2014-04-22 14:20:47 -04:00
< div class = "codetabs" >
< div data-lang = "scala" markdown = "1" >
We used to represent a feature vector by `Array[Double]` , which is replaced by
2014-05-18 20:00:57 -04:00
[`Vector` ](api/scala/index.html#org.apache.spark.mllib.linalg.Vector ) in v1.0. Algorithms that used
2014-04-22 14:20:47 -04:00
to accept `RDD[Array[Double]]` now take
2014-05-18 20:00:57 -04:00
`RDD[Vector]` . [`LabeledPoint` ](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint )
2014-04-22 14:20:47 -04:00
is now a wrapper of `(Double, Vector)` instead of `(Double, Array[Double])` . Converting
`Array[Double]` to `Vector` is straightforward:
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val array: Array[Double] = ... // a double array
val vector: Vector = Vectors.dense(array) // a dense vector
{% endhighlight %}
2014-05-18 20:00:57 -04:00
[`Vectors` ](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ ) provides factory methods to create sparse vectors.
2014-04-22 14:20:47 -04:00
*Note*. Scala imports `scala.collection.immutable.Vector` by default, so you have to import `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector` .
< / div >
< div data-lang = "java" markdown = "1" >
We used to represent a feature vector by `double[]` , which is replaced by
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
[`Vector` ](api/java/index.html?org/apache/spark/mllib/linalg/Vector.html ) in v1.0. Algorithms that used
2014-04-22 14:20:47 -04:00
to accept `RDD<double[]>` now take
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
`RDD<Vector>` . [`LabeledPoint` ](api/java/index.html?org/apache/spark/mllib/regression/LabeledPoint.html )
2014-04-22 14:20:47 -04:00
is now a wrapper of `(double, Vector)` instead of `(double, double[])` . Converting `double[]` to
`Vector` is straightforward:
{% highlight java %}
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
double[] array = ... // a double array
2014-05-06 23:07:22 -04:00
Vector vector = Vectors.dense(array); // a dense vector
2014-04-22 14:20:47 -04:00
{% endhighlight %}
2014-05-18 20:00:57 -04:00
[`Vectors` ](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$ ) provides factory methods to
2014-04-22 14:20:47 -04:00
create sparse vectors.
< / div >
< div data-lang = "python" markdown = "1" >
We used to represent a labeled feature vector in a NumPy array, where the first entry corresponds to
the label and the rest are features. This representation is replaced by class
2014-05-18 20:00:57 -04:00
[`LabeledPoint` ](api/python/pyspark.mllib.regression.LabeledPoint-class.html ), which takes both
2014-04-22 14:20:47 -04:00
dense and sparse feature vectors.
{% highlight python %}
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
{% endhighlight %}
< / div >
< / div >