[SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide
jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`. dbtsai I left the code tab for you to add example code. Do you think it is the right place? Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #6504 from coderxiang/elasticnet and squashes the following commits: f6061ee [Shuo Xiang] typo 90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet 0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods 8747190 [Shuo Xiang] merge master 706d3f7 [Shuo Xiang] add python code 9bc2b4c [Shuo Xiang] typo db32a60 [Shuo Xiang] java code sample aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet a0dae07 [Shuo Xiang] simplify code d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md 78d9366 [Shuo Xiang] address comments 8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet 8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 9262a72 [Shuo Xiang] update 7e07d12 [Shuo Xiang] update b32f21a [Shuo Xiang] add doc for elastic net in sparkml 937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 98804c9 [Shuo Xiang] fix bug in topBykey and update test
This commit is contained in:
parent
9716a727fb
commit
303c1201c4
|
@ -3,6 +3,24 @@ layout: global
|
|||
title: Spark ML Programming Guide
|
||||
---
|
||||
|
||||
`\[
|
||||
\newcommand{\R}{\mathbb{R}}
|
||||
\newcommand{\E}{\mathbb{E}}
|
||||
\newcommand{\x}{\mathbf{x}}
|
||||
\newcommand{\y}{\mathbf{y}}
|
||||
\newcommand{\wv}{\mathbf{w}}
|
||||
\newcommand{\av}{\mathbf{\alpha}}
|
||||
\newcommand{\bv}{\mathbf{b}}
|
||||
\newcommand{\N}{\mathbb{N}}
|
||||
\newcommand{\id}{\mathbf{I}}
|
||||
\newcommand{\ind}{\mathbf{1}}
|
||||
\newcommand{\0}{\mathbf{0}}
|
||||
\newcommand{\unit}{\mathbf{e}}
|
||||
\newcommand{\one}{\mathbf{1}}
|
||||
\newcommand{\zero}{\mathbf{0}}
|
||||
\]`
|
||||
|
||||
|
||||
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
|
||||
high-level APIs that help users create and tune practical machine learning pipelines.
|
||||
|
||||
|
@ -154,6 +172,19 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
|
|||
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
|
||||
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
|
||||
|
||||
# Algorithm Guides
|
||||
|
||||
There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
|
||||
|
||||
**Pipelines API Algorithm Guides**
|
||||
|
||||
* [Feature Extraction, Transformation, and Selection](ml-features.html)
|
||||
* [Ensembles](ml-ensembles.html)
|
||||
|
||||
**Algorithms in `spark.ml`**
|
||||
|
||||
* [Linear methods with elastic net regularization](ml-linear-methods.html)
|
||||
|
||||
# Code Examples
|
||||
|
||||
This section gives code examples illustrating the functionality discussed above.
|
||||
|
|
129
docs/ml-linear-methods.md
Normal file
129
docs/ml-linear-methods.md
Normal file
|
@ -0,0 +1,129 @@
|
|||
---
|
||||
layout: global
|
||||
title: Linear Methods - ML
|
||||
displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods
|
||||
---
|
||||
|
||||
|
||||
`\[
|
||||
\newcommand{\R}{\mathbb{R}}
|
||||
\newcommand{\E}{\mathbb{E}}
|
||||
\newcommand{\x}{\mathbf{x}}
|
||||
\newcommand{\y}{\mathbf{y}}
|
||||
\newcommand{\wv}{\mathbf{w}}
|
||||
\newcommand{\av}{\mathbf{\alpha}}
|
||||
\newcommand{\bv}{\mathbf{b}}
|
||||
\newcommand{\N}{\mathbb{N}}
|
||||
\newcommand{\id}{\mathbf{I}}
|
||||
\newcommand{\ind}{\mathbf{1}}
|
||||
\newcommand{\0}{\mathbf{0}}
|
||||
\newcommand{\unit}{\mathbf{e}}
|
||||
\newcommand{\one}{\mathbf{1}}
|
||||
\newcommand{\zero}{\mathbf{0}}
|
||||
\]`
|
||||
|
||||
|
||||
In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
|
||||
`\[
|
||||
\alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1].
|
||||
\]`
|
||||
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.
|
||||
|
||||
**Examples**
|
||||
|
||||
<div class="codetabs">
|
||||
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
||||
{% highlight scala %}
|
||||
|
||||
import org.apache.spark.ml.classification.LogisticRegression
|
||||
import org.apache.spark.mllib.util.MLUtils
|
||||
|
||||
// Load training data
|
||||
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
|
||||
|
||||
val lr = new LogisticRegression()
|
||||
.setMaxIter(10)
|
||||
.setRegParam(0.3)
|
||||
.setElasticNetParam(0.8)
|
||||
|
||||
// Fit the model
|
||||
val lrModel = lr.fit(training)
|
||||
|
||||
// Print the weights and intercept for logistic regression
|
||||
println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")
|
||||
|
||||
{% endhighlight %}
|
||||
|
||||
</div>
|
||||
|
||||
<div data-lang="java" markdown="1">
|
||||
|
||||
{% highlight java %}
|
||||
|
||||
import org.apache.spark.ml.classification.LogisticRegression;
|
||||
import org.apache.spark.ml.classification.LogisticRegressionModel;
|
||||
import org.apache.spark.mllib.regression.LabeledPoint;
|
||||
import org.apache.spark.mllib.util.MLUtils;
|
||||
import org.apache.spark.SparkConf;
|
||||
import org.apache.spark.SparkContext;
|
||||
import org.apache.spark.sql.DataFrame;
|
||||
import org.apache.spark.sql.SQLContext;
|
||||
|
||||
public class LogisticRegressionWithElasticNetExample {
|
||||
public static void main(String[] args) {
|
||||
SparkConf conf = new SparkConf()
|
||||
.setAppName("Logistic Regression with Elastic Net Example");
|
||||
|
||||
SparkContext sc = new SparkContext(conf);
|
||||
SQLContext sql = new SQLContext(sc);
|
||||
String path = "sample_libsvm_data.txt";
|
||||
|
||||
// Load training data
|
||||
DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class);
|
||||
|
||||
LogisticRegression lr = new LogisticRegression()
|
||||
.setMaxIter(10)
|
||||
.setRegParam(0.3)
|
||||
.setElasticNetParam(0.8)
|
||||
|
||||
// Fit the model
|
||||
LogisticRegressionModel lrModel = lr.fit(training);
|
||||
|
||||
// Print the weights and intercept for logistic regression
|
||||
System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());
|
||||
}
|
||||
}
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
|
||||
<div data-lang="python" markdown="1">
|
||||
|
||||
{% highlight python %}
|
||||
|
||||
from pyspark.ml.classification import LogisticRegression
|
||||
from pyspark.mllib.regression import LabeledPoint
|
||||
from pyspark.mllib.util import MLUtils
|
||||
|
||||
# Load training data
|
||||
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
|
||||
|
||||
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
|
||||
|
||||
# Fit the model
|
||||
lrModel = lr.fit(training)
|
||||
|
||||
# Print the weights and intercept for logistic regression
|
||||
print("Weights: " + str(lrModel.weights))
|
||||
print("Intercept: " + str(lrModel.intercept))
|
||||
{% endhighlight %}
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
### Optimization
|
||||
|
||||
The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
|
||||
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.
|
|
@ -10,7 +10,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Linear Methods
|
|||
|
||||
`\[
|
||||
\newcommand{\R}{\mathbb{R}}
|
||||
\newcommand{\E}{\mathbb{E}}
|
||||
\newcommand{\E}{\mathbb{E}}
|
||||
\newcommand{\x}{\mathbf{x}}
|
||||
\newcommand{\y}{\mathbf{y}}
|
||||
\newcommand{\wv}{\mathbf{w}}
|
||||
|
@ -18,10 +18,10 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Linear Methods
|
|||
\newcommand{\bv}{\mathbf{b}}
|
||||
\newcommand{\N}{\mathbb{N}}
|
||||
\newcommand{\id}{\mathbf{I}}
|
||||
\newcommand{\ind}{\mathbf{1}}
|
||||
\newcommand{\0}{\mathbf{0}}
|
||||
\newcommand{\unit}{\mathbf{e}}
|
||||
\newcommand{\one}{\mathbf{1}}
|
||||
\newcommand{\ind}{\mathbf{1}}
|
||||
\newcommand{\0}{\mathbf{0}}
|
||||
\newcommand{\unit}{\mathbf{e}}
|
||||
\newcommand{\one}{\mathbf{1}}
|
||||
\newcommand{\zero}{\mathbf{0}}
|
||||
\]`
|
||||
|
||||
|
@ -29,7 +29,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Linear Methods
|
|||
|
||||
Many standard *machine learning* methods can be formulated as a convex optimization problem, i.e.
|
||||
the task of finding a minimizer of a convex function `$f$` that depends on a variable vector
|
||||
`$\wv$` (called `weights` in the code), which has `$d$` entries.
|
||||
`$\wv$` (called `weights` in the code), which has `$d$` entries.
|
||||
Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
|
||||
the objective function is of the form
|
||||
`\begin{equation}
|
||||
|
@ -39,7 +39,7 @@ the objective function is of the form
|
|||
\ .
|
||||
\end{equation}`
|
||||
Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
|
||||
`$y_i\in\R$` are their corresponding labels, which we want to predict.
|
||||
`$y_i\in\R$` are their corresponding labels, which we want to predict.
|
||||
We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
|
||||
Several of MLlib's classification and regression algorithms fall into this category,
|
||||
and are discussed here.
|
||||
|
@ -99,6 +99,9 @@ regularizers in MLlib:
|
|||
<tr>
|
||||
<td>L1</td><td>$\|\wv\|_1$</td><td>$\mathrm{sign}(\wv)$</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>elastic net</td><td>$\alpha \|\wv\|_1 + (1-\alpha)\frac{1}{2}\|\wv\|_2^2$</td><td>$\alpha \mathrm{sign}(\wv) + (1-\alpha) \wv$</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
@ -107,7 +110,7 @@ of `$\wv$`.
|
|||
|
||||
L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
|
||||
However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
|
||||
It is not recommended to train models without any regularization,
|
||||
[Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization) is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization,
|
||||
especially when the number of training examples is small.
|
||||
|
||||
### Optimization
|
||||
|
@ -531,7 +534,7 @@ sameModel = LogisticRegressionModel.load(sc, "myModelPath")
|
|||
### Linear least squares, Lasso, and ridge regression
|
||||
|
||||
|
||||
Linear least squares is the most common formulation for regression problems.
|
||||
Linear least squares is the most common formulation for regression problems.
|
||||
It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss
|
||||
function in the formulation given by the squared loss:
|
||||
`\[
|
||||
|
@ -539,8 +542,8 @@ L(\wv;\x,y) := \frac{1}{2} (\wv^T \x - y)^2.
|
|||
\]`
|
||||
|
||||
Various related regression methods are derived by using different types of regularization:
|
||||
[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
|
||||
[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
|
||||
[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
|
||||
[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
|
||||
no regularization; [*ridge regression*](http://en.wikipedia.org/wiki/Ridge_regression) uses L2
|
||||
regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) uses L1
|
||||
regularization. For all of these models, the average loss or training error, $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
|
||||
|
@ -552,7 +555,7 @@ known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_erro
|
|||
|
||||
<div data-lang="scala" markdown="1">
|
||||
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
|
||||
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
|
||||
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
|
||||
values. We compute the mean squared error at the end to evaluate
|
||||
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
|
||||
|
||||
|
@ -614,7 +617,7 @@ public class LinearRegression {
|
|||
public static void main(String[] args) {
|
||||
SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
|
||||
JavaSparkContext sc = new JavaSparkContext(conf);
|
||||
|
||||
|
||||
// Load and parse the data
|
||||
String path = "data/mllib/ridge-data/lpsa.data";
|
||||
JavaRDD<String> data = sc.textFile(path);
|
||||
|
@ -634,7 +637,7 @@ public class LinearRegression {
|
|||
|
||||
// Building the model
|
||||
int numIterations = 100;
|
||||
final LinearRegressionModel model =
|
||||
final LinearRegressionModel model =
|
||||
LinearRegressionWithSGD.train(JavaRDD.toRDD(parsedData), numIterations);
|
||||
|
||||
// Evaluate model on training examples and compute training error
|
||||
|
@ -665,7 +668,7 @@ public class LinearRegression {
|
|||
|
||||
<div data-lang="python" markdown="1">
|
||||
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
|
||||
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
|
||||
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
|
||||
values. We compute the mean squared error at the end to evaluate
|
||||
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
|
||||
|
||||
|
@ -706,8 +709,8 @@ a dependency.
|
|||
|
||||
###Streaming linear regression
|
||||
|
||||
When data arrive in a streaming fashion, it is useful to fit regression models online,
|
||||
updating the parameters of the model as new data arrives. MLlib currently supports
|
||||
When data arrive in a streaming fashion, it is useful to fit regression models online,
|
||||
updating the parameters of the model as new data arrives. MLlib currently supports
|
||||
streaming linear regression using ordinary least squares. The fitting is similar
|
||||
to that performed offline, except fitting occurs on each batch of data, so that
|
||||
the model continually updates to reflect the data from the stream.
|
||||
|
@ -722,7 +725,7 @@ online to the first stream, and make predictions on the second stream.
|
|||
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
||||
First, we import the necessary classes for parsing our input data and creating the model.
|
||||
First, we import the necessary classes for parsing our input data and creating the model.
|
||||
|
||||
{% highlight scala %}
|
||||
|
||||
|
@ -734,7 +737,7 @@ import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
|
|||
|
||||
Then we make input streams for training and testing data. We assume a StreamingContext `ssc`
|
||||
has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
|
||||
for more info. For this example, we use labeled points in training and testing streams,
|
||||
for more info. For this example, we use labeled points in training and testing streams,
|
||||
but in practice you will likely want to use unlabeled vectors for test data.
|
||||
|
||||
{% highlight scala %}
|
||||
|
@ -754,7 +757,7 @@ val model = new StreamingLinearRegressionWithSGD()
|
|||
|
||||
{% endhighlight %}
|
||||
|
||||
Now we register the streams for training and testing and start the job.
|
||||
Now we register the streams for training and testing and start the job.
|
||||
Printing predictions alongside true labels lets us easily see the result.
|
||||
|
||||
{% highlight scala %}
|
||||
|
@ -764,14 +767,14 @@ model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
|
|||
|
||||
ssc.start()
|
||||
ssc.awaitTermination()
|
||||
|
||||
|
||||
{% endhighlight %}
|
||||
|
||||
We can now save text files with data to the training or testing folders.
|
||||
Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
|
||||
and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
|
||||
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
|
||||
As you feed more data to the training directory, the predictions
|
||||
Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
|
||||
and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
|
||||
the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
|
||||
As you feed more data to the training directory, the predictions
|
||||
will get better!
|
||||
|
||||
</div>
|
||||
|
|
Loading…
Reference in a new issue