Shuo Xiang 303c1201c4 [SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide

jkbradley I put the elastic net under the **Algorithm guide** section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`.

dbtsai I left the code tab for you to add example code. Do you think it is the right place?

Author: Shuo Xiang <shuoxiangpub@gmail.com>

Closes #6504 from coderxiang/elasticnet and squashes the following commits:

f6061ee [Shuo Xiang] typo
90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods
8747190 [Shuo Xiang] merge master
706d3f7 [Shuo Xiang] add python code
9bc2b4c [Shuo Xiang] typo
db32a60 [Shuo Xiang] java code sample
aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet
a0dae07 [Shuo Xiang] simplify code
d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge
df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md
78d9366 [Shuo Xiang] address comments
8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet
8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc
89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
9262a72 [Shuo Xiang] update
7e07d12 [Shuo Xiang] update
b32f21a [Shuo Xiang] add doc for elastic net in sparkml
937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc
180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
98804c9 [Shuo Xiang] fix bug in topBykey and update test

2015-07-15 12:10:53 -07:00

4.5 KiB

Raw Blame History

layout	title	displayTitle
global	Linear Methods - ML	<a href="ml-guide.html">ML</a> - Linear Methods

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]

In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to the linear methods in mllib for details. In spark.ml, we also include Pipelines API for Elastic net, a hybrid of L1 and L2 regularization proposed in this paper. Mathematically it is defined as a linear combination of the L1-norm and the L2-norm: \[ \alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1]. \] By setting \alpha properly, it contains both L1 and L2 regularization as special cases. For example, if a linear regression model is trained with the elastic net parameter \alpha set to 1, it is equivalent to a Lasso model. On the other hand, if \alpha is set to 0, the trained model reduces to a ridge regression model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.

Examples

{% highlight scala %}

import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.mllib.util.MLUtils

// Load training data val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()

val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8)

// Fit the model val lrModel = lr.fit(training)

// Print the weights and intercept for logistic regression println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")

{% endhighlight %}

{% highlight java %}

import org.apache.spark.ml.classification.LogisticRegression; import org.apache.spark.ml.classification.LogisticRegressionModel; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext;

public class LogisticRegressionWithElasticNetExample { public static void main(String[] args) { SparkConf conf = new SparkConf() .setAppName("Logistic Regression with Elastic Net Example");

SparkContext sc = new SparkContext(conf);
SQLContext sql = new SQLContext(sc);
String path = "sample_libsvm_data.txt";

// Load training data
DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class);

LogisticRegression lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

// Fit the model
LogisticRegressionModel lrModel = lr.fit(training);

// Print the weights and intercept for logistic regression
System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());

} } {% endhighlight %}

{% highlight python %}

from pyspark.ml.classification import LogisticRegression from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.util import MLUtils

Load training data

training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

Fit the model

lrModel = lr.fit(training)

Print the weights and intercept for logistic regression

print("Weights: " + str(lrModel.weights)) print("Intercept: " + str(lrModel.intercept)) {% endhighlight %}

Optimization

The optimization algorithm underlies the implementation is called Orthant-Wise Limited-memory QuasiNewton (OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.

4.5 KiB Raw Blame History

Load training data

Fit the model

Print the weights and intercept for logistic regression

Optimization

4.5 KiB

Raw Blame History