2013-08-31 17:21:10 -04:00
|
|
|
---
|
|
|
|
layout: global
|
2015-02-05 14:12:50 -05:00
|
|
|
title: MLlib
|
|
|
|
displayTitle: Machine Learning Library (MLlib) Guide
|
|
|
|
description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
|
2013-08-31 17:21:10 -04:00
|
|
|
---
|
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
MLlib is Spark's machine learning (ML) library.
|
|
|
|
Its goal is to make practical machine learning scalable and easy.
|
|
|
|
It consists of common learning algorithms and utilities, including classification, regression,
|
|
|
|
clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization
|
|
|
|
primitives and higher-level pipeline APIs.
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
It divides into two packages:
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-09-15 22:43:26 -04:00
|
|
|
* [`spark.mllib`](mllib-guide.html#data-types-algorithms-and-utilities) contains the original API
|
2015-08-30 02:26:23 -04:00
|
|
|
built on top of [RDDs](programming-guide.html#resilient-distributed-datasets-rdds).
|
2015-09-15 22:43:26 -04:00
|
|
|
* [`spark.ml`](ml-guide.html) provides higher-level API
|
2015-08-30 02:26:23 -04:00
|
|
|
built on top of [DataFrames](sql-programming-guide.html#dataframes) for constructing ML pipelines.
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
|
|
|
|
But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
|
|
|
|
Users should be comfortable using `spark.mllib` features and expect more features coming.
|
|
|
|
Developers should contribute new algorithms to `spark.ml` if they fit the ML pipeline concept well,
|
|
|
|
e.g., feature extractors and transformers.
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
We list major functionality from both below, with links to detailed guides.
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
# spark.mllib: data types, algorithms, and utilities
|
2014-01-03 19:38:33 -05:00
|
|
|
|
2014-08-27 04:19:48 -04:00
|
|
|
* [Data types](mllib-data-types.html)
|
|
|
|
* [Basic statistics](mllib-statistics.html)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [summary statistics](mllib-statistics.html#summary-statistics)
|
|
|
|
* [correlations](mllib-statistics.html#correlations)
|
|
|
|
* [stratified sampling](mllib-statistics.html#stratified-sampling)
|
|
|
|
* [hypothesis testing](mllib-statistics.html#hypothesis-testing)
|
2015-11-30 18:38:44 -05:00
|
|
|
* [streaming significance testing](mllib-statistics.html#streaming-significance-testing)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [random data generation](mllib-statistics.html#random-data-generation)
|
2014-08-12 20:15:21 -04:00
|
|
|
* [Classification and regression](mllib-classification-regression.html)
|
|
|
|
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
|
2014-04-22 14:20:47 -04:00
|
|
|
* [naive Bayes](mllib-naive-bayes.html)
|
2014-12-03 20:57:50 -05:00
|
|
|
* [decision trees](mllib-decision-tree.html)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [ensembles of trees (Random Forests and Gradient-Boosted Trees)](mllib-ensembles.html)
|
2015-02-15 12:10:03 -05:00
|
|
|
* [isotonic regression](mllib-isotonic-regression.html)
|
2014-04-22 14:20:47 -04:00
|
|
|
* [Collaborative filtering](mllib-collaborative-filtering.html)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [alternating least squares (ALS)](mllib-collaborative-filtering.html#collaborative-filtering)
|
2014-04-22 14:20:47 -04:00
|
|
|
* [Clustering](mllib-clustering.html)
|
2015-02-13 18:09:27 -05:00
|
|
|
* [k-means](mllib-clustering.html#k-means)
|
|
|
|
* [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
|
|
|
|
* [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic)
|
|
|
|
* [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
|
2015-12-16 13:55:42 -05:00
|
|
|
* [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
|
2015-02-13 18:09:27 -05:00
|
|
|
* [streaming k-means](mllib-clustering.html#streaming-k-means)
|
2014-04-22 14:20:47 -04:00
|
|
|
* [Dimensionality reduction](mllib-dimensionality-reduction.html)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)
|
|
|
|
* [principal component analysis (PCA)](mllib-dimensionality-reduction.html#principal-component-analysis-pca)
|
2014-08-12 20:15:21 -04:00
|
|
|
* [Feature extraction and transformation](mllib-feature-extraction.html)
|
2015-02-18 13:09:56 -05:00
|
|
|
* [Frequent pattern mining](mllib-frequent-pattern-mining.html)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [FP-growth](mllib-frequent-pattern-mining.html#fp-growth)
|
2015-08-18 15:53:57 -04:00
|
|
|
* [association rules](mllib-frequent-pattern-mining.html#association-rules)
|
2015-08-17 20:53:24 -04:00
|
|
|
* [PrefixSpan](mllib-frequent-pattern-mining.html#prefix-span)
|
2015-08-30 02:26:23 -04:00
|
|
|
* [Evaluation metrics](mllib-evaluation-metrics.html)
|
|
|
|
* [PMML model export](mllib-pmml-model-export.html)
|
2014-08-12 20:15:21 -04:00
|
|
|
* [Optimization (developer)](mllib-optimization.html)
|
2015-08-17 18:42:14 -04:00
|
|
|
* [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
|
|
|
|
* [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
|
2014-04-22 14:20:47 -04:00
|
|
|
|
2015-01-14 20:50:33 -05:00
|
|
|
# spark.ml: high-level APIs for ML pipelines
|
2014-12-04 04:00:06 -05:00
|
|
|
|
2015-12-10 15:50:46 -05:00
|
|
|
* [Overview: estimators, transformers and pipelines](ml-guide.html)
|
2015-12-08 21:40:21 -05:00
|
|
|
* [Extracting, transforming and selecting features](ml-features.html)
|
|
|
|
* [Classification and regression](ml-classification-regression.html)
|
2015-11-30 17:56:51 -05:00
|
|
|
* [Clustering](ml-clustering.html)
|
2016-02-16 08:03:28 -05:00
|
|
|
* [Collaborative filtering](ml-collaborative-filtering.html)
|
2015-12-08 21:40:21 -05:00
|
|
|
* [Advanced topics](ml-advanced.html)
|
|
|
|
|
|
|
|
Some techniques are not available yet in spark.ml, most notably dimensionality reduction
|
2015-12-16 14:53:04 -05:00
|
|
|
Users can seamlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`.
|
2014-12-04 04:00:06 -05:00
|
|
|
|
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
|
|
|
# Dependencies
|
2014-04-22 14:20:47 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
|
|
|
|
[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
|
|
|
|
If natives libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
|
|
|
|
implementation will be used instead.
|
2015-02-08 19:34:26 -05:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
|
|
|
|
proxies by default.
|
|
|
|
To configure `netlib-java` / Breeze to use system optimised binaries, include
|
|
|
|
`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
|
|
|
|
project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
|
|
|
|
platform's additional installation instructions.
|
2015-02-08 19:34:26 -05:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
|
2015-02-08 19:34:26 -05:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
|
|
|
|
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
|
2014-04-22 14:20:47 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
# Migration guide
|
2014-04-22 14:20:47 -04:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
MLlib is under active development.
|
|
|
|
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
|
|
|
|
and the migration guide below will explain all changes between releases.
|
|
|
|
|
2015-12-16 14:53:04 -05:00
|
|
|
## From 1.5 to 1.6
|
2014-04-22 14:20:47 -04:00
|
|
|
|
2015-12-16 14:53:04 -05:00
|
|
|
There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
|
|
|
|
deprecations and changes of behavior.
|
2014-12-03 20:57:50 -05:00
|
|
|
|
2015-12-16 14:53:04 -05:00
|
|
|
Deprecations:
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-12-16 14:53:04 -05:00
|
|
|
* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
|
|
|
|
In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
|
|
|
|
* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
|
|
|
|
In `spark.ml.classification.LogisticRegressionModel` and
|
|
|
|
`spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
|
|
|
|
the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to
|
|
|
|
algorithms.
|
2015-06-21 19:25:25 -04:00
|
|
|
|
2015-12-16 14:53:04 -05:00
|
|
|
Changes of behavior:
|
|
|
|
|
|
|
|
* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
|
|
|
|
`spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
|
|
|
|
Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
|
|
|
|
`GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
|
|
|
|
previous error); for small errors (`< 0.01`), it uses absolute error.
|
|
|
|
* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
|
|
|
|
`spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
|
|
|
|
tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
|
|
|
|
behavior of the simpler `Tokenizer` transformer.
|
2014-12-03 20:57:50 -05:00
|
|
|
|
2015-08-28 16:53:31 -04:00
|
|
|
## Previous Spark versions
|
2014-12-03 20:57:50 -05:00
|
|
|
|
2015-02-20 05:31:32 -05:00
|
|
|
Earlier migration guides are archived [on this page](mllib-migration-guides.html).
|
2015-08-28 16:53:31 -04:00
|
|
|
|
|
|
|
---
|