5ffd5d3838
## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
89 lines
5.1 KiB
Markdown
89 lines
5.1 KiB
Markdown
---
|
|
layout: global
|
|
title: Advanced topics
|
|
displayTitle: Advanced topics
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
`\[
|
|
\newcommand{\R}{\mathbb{R}}
|
|
\newcommand{\E}{\mathbb{E}}
|
|
\newcommand{\x}{\mathbf{x}}
|
|
\newcommand{\y}{\mathbf{y}}
|
|
\newcommand{\wv}{\mathbf{w}}
|
|
\newcommand{\av}{\mathbf{\alpha}}
|
|
\newcommand{\bv}{\mathbf{b}}
|
|
\newcommand{\N}{\mathbb{N}}
|
|
\newcommand{\id}{\mathbf{I}}
|
|
\newcommand{\ind}{\mathbf{1}}
|
|
\newcommand{\0}{\mathbf{0}}
|
|
\newcommand{\unit}{\mathbf{e}}
|
|
\newcommand{\one}{\mathbf{1}}
|
|
\newcommand{\zero}{\mathbf{0}}
|
|
\]`
|
|
|
|
# Optimization of linear methods (developer)
|
|
|
|
## Limited-memory BFGS (L-BFGS)
|
|
[L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS) is an optimization
|
|
algorithm in the family of quasi-Newton methods to solve the optimization problems of the form
|
|
`$\min_{\wv \in\R^d} \; f(\wv)$`. The L-BFGS method approximates the objective function locally as a
|
|
quadratic without evaluating the second partial derivatives of the objective function to construct the
|
|
Hessian matrix. The Hessian matrix is approximated by previous gradient evaluations, so there is no
|
|
vertical scalability issue (the number of training features) unlike computing the Hessian matrix
|
|
explicitly in Newton's method. As a result, L-BFGS often achieves faster convergence compared with
|
|
other first-order optimizations.
|
|
|
|
[Orthant-Wise Limited-memory
|
|
Quasi-Newton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
|
|
(OWL-QN) is an extension of L-BFGS that can effectively handle L1 and elastic net regularization.
|
|
|
|
L-BFGS is used as a solver for [LinearRegression](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression),
|
|
[LogisticRegression](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression),
|
|
[AFTSurvivalRegression](api/scala/index.html#org.apache.spark.ml.regression.AFTSurvivalRegression)
|
|
and [MultilayerPerceptronClassifier](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier).
|
|
|
|
MLlib L-BFGS solver calls the corresponding implementation in [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGS.scala).
|
|
|
|
## Normal equation solver for weighted least squares
|
|
|
|
MLlib implements normal equation solver for [weighted least squares](https://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares) by [WeightedLeastSquares](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala).
|
|
|
|
Given $n$ weighted observations $(w_i, a_i, b_i)$:
|
|
|
|
* $w_i$ the weight of i-th observation
|
|
* $a_i$ the features vector of i-th observation
|
|
* $b_i$ the label of i-th observation
|
|
|
|
The number of features for each observation is $m$. We use the following weighted least squares formulation:
|
|
`\[
|
|
minimize_{x}\frac{1}{2} \sum_{i=1}^n \frac{w_i(a_i^T x -b_i)^2}{\sum_{k=1}^n w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
|
|
\]`
|
|
where $\lambda$ is the regularization parameter, $\delta$ is the population standard deviation of the label
|
|
and $\sigma_j$ is the population standard deviation of the j-th feature column.
|
|
|
|
This objective function has an analytic solution and it requires only one pass over the data to collect necessary statistics to solve.
|
|
Unlike the original dataset which can only be stored in a distributed system,
|
|
these statistics can be loaded into memory on a single machine if the number of features is relatively small, and then we can solve the objective function through Cholesky factorization on the driver.
|
|
|
|
WeightedLeastSquares only supports L2 regularization and provides options to enable or disable regularization and standardization.
|
|
In order to make the normal equation approach efficient, WeightedLeastSquares requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
|
|
|
|
## Iteratively reweighted least squares (IRLS)
|
|
|
|
MLlib implements [iteratively reweighted least squares (IRLS)](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) by [IterativelyReweightedLeastSquares](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/IterativelyReweightedLeastSquares.scala).
|
|
It can be used to find the maximum likelihood estimates of a generalized linear model (GLM), find M-estimator in robust regression and other optimization problems.
|
|
Refer to [Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives](http://www.jstor.org/stable/2345503) for more information.
|
|
|
|
It solves certain optimization problems iteratively through the following procedure:
|
|
|
|
* linearize the objective at current solution and update corresponding weight.
|
|
* solve a weighted least squares (WLS) problem by WeightedLeastSquares.
|
|
* repeat above steps until convergence.
|
|
|
|
Since it involves solving a weighted least squares (WLS) problem by WeightedLeastSquares in each iteration,
|
|
it also requires the number of features to be no more than 4096.
|
|
Currently IRLS is used as the default solver of [GeneralizedLinearRegression](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression).
|