spark-instrumented-optimizer/docs/mllib-guide.md

---
layout: global
title: Machine Learning Library (MLlib)
---


MLlib is a Spark implementation of some common machine learning (ML)
functionality, as well associated tests and data generators.  MLlib
currently supports four common types of machine learning problem settings,
namely, binary classification, regression, clustering and collaborative
filtering, as well as an underlying gradient descent optimization primitive.

# Available Methods
The following links provide a detailed explanation of the methods and usage examples for each of them: 

* <a href="mllib-classification-regression.html">Classification and Regression</a>
  * Binary Classification
    * SVM (L1 and L2 regularized)
    * Logistic Regression (L1 and L2 regularized)
  * Linear Regression
    * Least Squares
    * Lasso
    * Ridge Regression
* <a href="mllib-clustering.html">Clustering</a>
  * k-Means
* <a href="mllib-collaborative-filtering.html">Collaborative Filtering</a>
  * Matrix Factorization using Alternating Least Squares
* <a href="mllib-optimization.html">Optimization</a>
  * Gradient Descent and Stochastic Gradient Descent
* <a href="mllib-linear-algebra.html">Linear Algebra</a>
  * Singular Value Decomposition

# Dependencies
MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
depends on native Fortran routines. You may need to install the 
[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries)
if it is not already present on your nodes. MLlib will throw a linking error if it cannot 
detect these libraries automatically.

To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.7 or newer
and Python 2.7.
More updates, describing changes to recommended use of environment vars and new Python stuff 2013-08-31 17:21:10 -04:00			`---`
			`layout: global`
			`title: Machine Learning Library (MLlib)`
			`---`

Added table of contents and minor fixes 2014-01-03 19:38:33 -05:00
updates based on comments to PR 2013-09-08 20:39:08 -04:00			`MLlib is a Spark implementation of some common machine learning (ML)`
Small tweaks to MLlib docs 2013-09-09 00:47:24 -04:00			`functionality, as well associated tests and data generators. MLlib`
updates based on comments to PR 2013-09-08 20:39:08 -04:00			`currently supports four common types of machine learning problem settings,`
			`namely, binary classification, regression, clustering and collaborative`
			`filtering, as well as an underlying gradient descent optimization primitive.`
Merge pull request #552 from martinjaggi/master. Closes #552. tex formulas in the documentation using mathjax. and spliting the MLlib documentation by techniques see jira https://spark-project.atlassian.net/browse/MLLIB-19 and https://github.com/shivaram/spark/compare/mathjax Author: Martin Jaggi <m.jaggi@gmail.com> == Merge branch commits == commit 0364bfabbfc347f917216057a20c39b631842481 Author: Martin Jaggi <m.jaggi@gmail.com> Date: Fri Feb 7 03:19:38 2014 +0100 minor polishing, as suggested by @pwendell commit dcd2142c164b2f602bf472bb152ad55bae82d31a Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 18:04:26 2014 +0100 enabling inline latex formulas with $.$ same mathjax configuration as used in math.stackexchange.com sample usage in the linear algebra (SVD) documentation commit bbafafd2b497a5acaa03a140bb9de1fbb7d67ffa Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 17:31:29 2014 +0100 split MLlib documentation by techniques and linked from the main mllib-guide.md site commit d1c5212b93c67436543c2d8ddbbf610fdf0a26eb Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 16:59:43 2014 +0100 enable mathjax formula in the .md documentation files code by @shivaram commit d73948db0d9bc36296054e79fec5b1a657b4eab4 Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 16:57:23 2014 +0100 minor update on how to compile the documentation 2014-02-08 14:39:13 -05:00
			`# Available Methods`
			`The following links provide a detailed explanation of the methods and usage examples for each of them:`

			`* <a href="mllib-classification-regression.html">Classification and Regression</a>`
			`* Binary Classification`
			`* SVM (L1 and L2 regularized)`
			`* Logistic Regression (L1 and L2 regularized)`
			`* Linear Regression`
			`* Least Squares`
			`* Lasso`
			`* Ridge Regression`
			`* <a href="mllib-clustering.html">Clustering</a>`
			`* k-Means`
			`* <a href="mllib-collaborative-filtering.html">Collaborative Filtering</a>`
			`* Matrix Factorization using Alternating Least Squares`
			`* <a href="mllib-optimization.html">Optimization</a>`
			`* Gradient Descent and Stochastic Gradient Descent`
			`* <a href="mllib-linear-algebra.html">Linear Algebra</a>`
			`* Singular Value Decomposition`
updated content 2013-09-06 00:06:50 -04:00
Document fortran dependency for MLBase 2013-09-10 00:45:04 -04:00			`# Dependencies`
			`MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself`
			`depends on native Fortran routines. You may need to install the`
			`[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries)`
			`if it is not already present on your nodes. MLlib will throw a linking error if it cannot`
			`detect these libraries automatically.`

Clarify that Python 2.7 is only needed for MLlib 2014-01-15 17:20:39 -05:00			`To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.7 or newer`
			`and Python 2.7.`
Update some Python MLlib parameters to use camelCase, and tweak docs We've used camel case in other Spark methods so it felt reasonable to keep using it here and make the code match Scala/Java as much as possible. Note that parameter names matter in Python because it allows passing optional parameters by name. 2014-01-10 03:12:43 -05:00