2014-12-04 04:00:06 -05:00
---
layout: global
2016-07-15 16:38:23 -04:00
title: "MLlib: Main Guide"
displayTitle: "Machine Learning Library (MLlib) Guide"
2014-12-04 04:00:06 -05:00
---
2016-07-15 16:38:23 -04:00
MLlib is Spark's machine learning (ML) library.
Its goal is to make practical machine learning scalable and easy.
At a high level, it provides tools such as:
2015-12-10 15:50:46 -05:00
2016-07-15 16:38:23 -04:00
* ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
* Featurization: feature extraction, transformation, dimensionality reduction, and selection
* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
* Persistence: saving and load algorithms, models, and Pipelines
* Utilities: linear algebra, statistics, data handling, etc.
2015-07-15 15:10:53 -04:00
2016-07-15 16:38:23 -04:00
# Announcement: DataFrame-based API is primary API
2015-07-15 15:10:53 -04:00
2016-07-15 16:38:23 -04:00
**The MLlib RDD-based API is now in maintenance mode.**
2015-06-21 19:25:25 -04:00
2016-07-15 16:38:23 -04:00
As of Spark 2.0, the [RDD ](programming-guide.html#resilient-distributed-datasets-rdds )-based APIs in the `spark.mllib` package have entered maintenance mode.
The primary Machine Learning API for Spark is now the [DataFrame ](sql-programming-guide.html )-based API in the `spark.ml` package.
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
*What are the implications?*
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
* MLlib will not add new features to the RDD-based API.
* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
* The RDD-based API is expected to be removed in Spark 3.0.
2015-09-15 22:43:26 -04:00
2016-07-15 16:38:23 -04:00
*Why is MLlib switching to the DataFrame-based API?*
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
* DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
* The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
2016-07-29 09:01:23 -04:00
* DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the [Pipelines guide ](ml-pipeline.html ) for details.
2014-12-04 04:00:06 -05:00
2016-12-09 20:34:52 -05:00
*What is "Spark ML"?*
* "Spark ML" is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API,
and the "Spark ML Pipelines" term we used initially to emphasize the pipeline concept.
*Is MLlib deprecated?*
* No. MLlib includes both the RDD-based API and the DataFrame-based API.
The RDD-based API is now in maintenance mode.
But neither API is deprecated, nor MLlib as a whole.
2016-07-15 16:38:23 -04:00
# Dependencies
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
MLlib uses the linear algebra package [Breeze ](http://www.scalanlp.org/ ), which depends on
[netlib-java ](https://github.com/fommil/netlib-java ) for optimised numerical processing.
If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
implementation will be used instead.
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java` 's native
proxies by default.
To configure `netlib-java` / Breeze to use system optimised binaries, include
`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl` ) as a dependency of your
project and read the [netlib-java ](https://github.com/fommil/netlib-java ) documentation for your
platform's additional installation instructions.
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
To use MLlib in Python, you will need [NumPy ](http://www.numpy.org ) version 1.4 or newer.
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala ](http://fommil.github.io/scalax14/#/ ).
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
# Migration guide
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
MLlib is under active development.
The APIs marked `Experimental` /`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.
2014-12-04 04:00:06 -05:00
2016-12-02 19:28:01 -05:00
## From 2.0 to 2.1
2014-12-04 04:00:06 -05:00
2016-07-15 16:38:23 -04:00
### Breaking changes
2016-12-02 19:28:01 -05:00
2016-07-15 16:38:23 -04:00
**Deprecated methods removed**
2016-12-02 19:28:01 -05:00
* `setLabelCol` in `feature.ChiSqSelectorModel`
* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees` )
* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees` )
* `model` in `regression.LinearRegressionSummary`
* `validateParams` in `PipelineStage`
* `validateParams` in `Evaluator`
2016-07-15 16:38:23 -04:00
### Deprecations and changes of behavior
**Deprecations**
2016-12-02 19:28:01 -05:00
* [SPARK-18592 ](https://issues.apache.org/jira/browse/SPARK-18592 ):
Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel` , `GBTClassificationModel` , `RandomForestClassificationModel` , `DecisionTreeRegressionModel` , `GBTRegressionModel` and `RandomForestRegressionModel`
2016-07-15 16:38:23 -04:00
**Changes of behavior**
2016-12-02 19:28:01 -05:00
* [SPARK-17870 ](https://issues.apache.org/jira/browse/SPARK-17870 ):
Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
* [SPARK-3261 ](https://issues.apache.org/jira/browse/SPARK-3261 ):
`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
* [SPARK-17389 ](https://issues.apache.org/jira/browse/SPARK-17389 ):
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.
2016-07-15 16:38:23 -04:00
## Previous Spark versions
Earlier migration guides are archived [on this page ](ml-migration-guides.html ).
2016-03-10 02:09:56 -05:00
2016-07-15 16:38:23 -04:00
---