2013-08-31 17:21:10 -04:00
---
layout: global
2015-02-05 14:12:50 -05:00
title: MLlib
displayTitle: Machine Learning Library (MLlib) Guide
description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
2013-08-31 17:21:10 -04:00
---
2014-08-12 20:15:21 -04:00
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
2014-04-22 14:20:47 -04:00
including classification, regression, clustering, collaborative
2015-06-21 19:25:25 -04:00
filtering, dimensionality reduction, as well as underlying optimization primitives.
Guides for individual algorithms are listed below.
The API is divided into 2 parts:
* [The original `spark.mllib` API ](mllib-guide.html#mllib-types-algorithms-and-utilities ) is the primary API.
* [The "Pipelines" `spark.ml` API ](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines ) is a higher-level API for constructing ML workflows.
We list major functionality from both below, with links to detailed guides.
# MLlib types, algorithms and utilities
This lists functionality included in `spark.mllib` , the main MLlib API.
2014-01-03 19:38:33 -05:00
2014-08-27 04:19:48 -04:00
* [Data types ](mllib-data-types.html )
* [Basic statistics ](mllib-statistics.html )
2014-04-22 14:20:47 -04:00
* summary statistics
2014-08-27 04:19:48 -04:00
* correlations
* stratified sampling
2014-08-12 20:15:21 -04:00
* hypothesis testing
2014-08-27 04:19:48 -04:00
* random data generation
2014-08-12 20:15:21 -04:00
* [Classification and regression ](mllib-classification-regression.html )
* [linear models (SVMs, logistic regression, linear regression) ](mllib-linear-methods.html )
2014-04-22 14:20:47 -04:00
* [naive Bayes ](mllib-naive-bayes.html )
2014-12-03 20:57:50 -05:00
* [decision trees ](mllib-decision-tree.html )
* [ensembles of trees ](mllib-ensembles.html ) (Random Forests and Gradient-Boosted Trees)
2015-02-15 12:10:03 -05:00
* [isotonic regression ](mllib-isotonic-regression.html )
2014-04-22 14:20:47 -04:00
* [Collaborative filtering ](mllib-collaborative-filtering.html )
* alternating least squares (ALS)
* [Clustering ](mllib-clustering.html )
2015-02-13 18:09:27 -05:00
* [k-means ](mllib-clustering.html#k-means )
* [Gaussian mixture ](mllib-clustering.html#gaussian-mixture )
* [power iteration clustering (PIC) ](mllib-clustering.html#power-iteration-clustering-pic )
* [latent Dirichlet allocation (LDA) ](mllib-clustering.html#latent-dirichlet-allocation-lda )
* [streaming k-means ](mllib-clustering.html#streaming-k-means )
2014-04-22 14:20:47 -04:00
* [Dimensionality reduction ](mllib-dimensionality-reduction.html )
* singular value decomposition (SVD)
* principal component analysis (PCA)
2014-08-12 20:15:21 -04:00
* [Feature extraction and transformation ](mllib-feature-extraction.html )
2015-02-18 13:09:56 -05:00
* [Frequent pattern mining ](mllib-frequent-pattern-mining.html )
* FP-growth
2014-08-12 20:15:21 -04:00
* [Optimization (developer) ](mllib-optimization.html )
2014-04-22 14:20:47 -04:00
* stochastic gradient descent
* limited-memory BFGS (L-BFGS)
2015-05-18 11:46:33 -04:00
* [PMML model export ](mllib-pmml-model-export.html )
2014-04-22 14:20:47 -04:00
2014-08-12 20:15:21 -04:00
MLlib is under active development.
2014-05-18 20:00:57 -04:00
The APIs marked `Experimental` /`DeveloperApi` may change in future releases,
2014-08-12 20:15:21 -04:00
and the migration guide below will explain all changes between releases.
2014-04-22 14:20:47 -04:00
2015-01-14 20:50:33 -05:00
# spark.ml: high-level APIs for ML pipelines
2014-12-04 04:00:06 -05:00
2015-02-20 05:31:32 -05:00
Spark 1.2 introduced a new package called `spark.ml` , which aims to provide a uniform set of
2015-01-14 20:50:33 -05:00
high-level APIs that help users create and tune practical machine learning pipelines.
2015-06-21 19:25:25 -04:00
*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi` .
2014-12-04 04:00:06 -05:00
2015-01-14 20:50:33 -05:00
Note that we will keep supporting and adding features to `spark.mllib` along with the
development of `spark.ml` .
Users should be comfortable using `spark.mllib` features and expect more features coming.
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
to `spark.ml` .
2014-12-04 04:00:06 -05:00
2015-06-21 19:25:25 -04:00
More detailed guides for `spark.ml` include:
* **[spark.ml programming guide](ml-guide.html)**: overview of the Pipelines API and major concepts
* [Feature transformers ](ml-features.html ): Details on transformers supported in the Pipelines API, including a few not in the lower-level `spark.mllib` API
* [Ensembles ](ml-ensembles.html ): Details on ensemble learning methods in the Pipelines API
2014-12-04 04:00:06 -05:00
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
# Dependencies
2014-04-22 14:20:47 -04:00
2015-02-08 19:34:26 -05:00
MLlib uses the linear algebra package
[Breeze ](http://www.scalanlp.org/ ), which depends on
[netlib-java ](https://github.com/fommil/netlib-java ) for optimised
numerical processing. If natives are not available at runtime, you
will see a warning message and a pure JVM implementation will be used
instead.
To learn more about the benefits and background of system optimised
natives, you may wish to watch Sam Halliday's ScalaX talk on
[High Performance Linear Algebra in Scala ](http://fommil.github.io/scalax14/#/ )).
Due to licensing issues with runtime proprietary binaries, we do not
include `netlib-java` 's native proxies by default. To configure
`netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with
`-Pnetlib-lgpl` ) as a dependency of your project and read the
[netlib-java ](https://github.com/fommil/netlib-java ) documentation for
your platform's additional installation instructions.
To use MLlib in Python, you will need [NumPy ](http://www.numpy.org )
version 1.4 or newer.
2014-04-22 14:20:47 -04:00
---
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
# Migration Guide
2014-04-22 14:20:47 -04:00
2015-02-20 05:31:32 -05:00
For the `spark.ml` package, please see the [spark.ml Migration Guide ](ml-guide.html#migration-guide ).
2014-12-03 20:57:50 -05:00
2015-06-21 19:25:25 -04:00
## From 1.3 to 1.4
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
* Gradient-Boosted Trees
* *(Breaking change)* The signature of the [`Loss.gradient` ](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss ) method was changed. This is only an issues for users who wrote their own losses for GBTs.
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy` ](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy ) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
* *(Breaking change)* The return value of [`LDA.run` ](api/scala/index.html#org.apache.spark.mllib.clustering.LDA ) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel` . The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
2014-12-03 20:57:50 -05:00
2015-02-20 05:31:32 -05:00
## Previous Spark Versions
2014-12-03 20:57:50 -05:00
2015-02-20 05:31:32 -05:00
Earlier migration guides are archived [on this page ](mllib-migration-guides.html ).