spark-instrumented-optimizer/docs/ml-linear-methods.md

---
layout: global
title: Linear Methods - ML
displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods
---


`\[
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\av}{\mathbf{\alpha}}
\newcommand{\bv}{\mathbf{b}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\id}{\mathbf{I}}
\newcommand{\ind}{\mathbf{1}}
\newcommand{\0}{\mathbf{0}}
\newcommand{\unit}{\mathbf{e}}
\newcommand{\one}{\mathbf{1}}
\newcommand{\zero}{\mathbf{0}}
\]`


In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
`\[
\alpha \|\wv\|_1 + (1-\alpha) \frac{1}{2}\|\wv\|_2^2, \alpha \in [0, 1].
\]`
By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">

{% highlight scala %}

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.util.MLUtils

// Load training data
val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()

val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

// Fit the model
val lrModel = lr.fit(training)

// Print the weights and intercept for logistic regression
println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")

{% endhighlight %}

</div>

<div data-lang="java" markdown="1">

{% highlight java %}

import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

public class LogisticRegressionWithElasticNetExample {
  public static void main(String[] args) {
    SparkConf conf = new SparkConf()
      .setAppName("Logistic Regression with Elastic Net Example");

    SparkContext sc = new SparkContext(conf);
    SQLContext sql = new SQLContext(sc);
    String path = "sample_libsvm_data.txt";

    // Load training data
    DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class);

    LogisticRegression lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)

    // Fit the model
    LogisticRegressionModel lrModel = lr.fit(training);

    // Print the weights and intercept for logistic regression
    System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());
  }
}
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">

{% highlight python %}

from pyspark.ml.classification import LogisticRegression
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils

# Load training data
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# Print the weights and intercept for logistic regression
print("Weights: " + str(lrModel.weights))
print("Intercept: " + str(lrModel.intercept))
{% endhighlight %}

</div>

</div>

### Optimization

The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)
(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.
[SPARK-7555] [DOCS] Add doc for elastic net in ml-guide and mllib-guide jkbradley I put the elastic net under the Algorithm guide section. Also add the formula of elastic net in mllib-linear `mllib-linear-methods#regularizers`. dbtsai I left the code tab for you to add example code. Do you think it is the right place? Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #6504 from coderxiang/elasticnet and squashes the following commits: f6061ee [Shuo Xiang] typo 90a7c88 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet 0610a36 [Shuo Xiang] move out the elastic net to ml-linear-methods 8747190 [Shuo Xiang] merge master 706d3f7 [Shuo Xiang] add python code 9bc2b4c [Shuo Xiang] typo db32a60 [Shuo Xiang] java code sample aab3b3a [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elasticnet a0dae07 [Shuo Xiang] simplify code d8616fd [Shuo Xiang] Update the definition of elastic net. Add scala code; Mention Lasso and Ridge df5bd14 [Shuo Xiang] use wikipeida page in ml-linear-methods.md 78d9366 [Shuo Xiang] address comments 8ce37c2 [Shuo Xiang] Merge branch 'elasticnet' of github.com:coderxiang/spark into elasticnet 8f24848 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 998d766 [Shuo Xiang] Merge branch 'elastic-net-doc' of github.com:coderxiang/spark into elastic-net-doc 89f10e4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 9262a72 [Shuo Xiang] update 7e07d12 [Shuo Xiang] update b32f21a [Shuo Xiang] add doc for elastic net in sparkml 937eef1 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into elastic-net-doc 180b496 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' aa0717d [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 98804c9 [Shuo Xiang] fix bug in topBykey and update test 2015-07-15 15:10:53 -04:00			`---`
			`layout: global`
			`title: Linear Methods - ML`
			`displayTitle: <a href="ml-guide.html">ML</a> - Linear Methods`
			`---`


			`\[
			`\newcommand{\R}{\mathbb{R}}`
			`\newcommand{\E}{\mathbb{E}}`
			`\newcommand{\x}{\mathbf{x}}`
			`\newcommand{\y}{\mathbf{y}}`
			`\newcommand{\wv}{\mathbf{w}}`
			`\newcommand{\av}{\mathbf{\alpha}}`
			`\newcommand{\bv}{\mathbf{b}}`
			`\newcommand{\N}{\mathbb{N}}`
			`\newcommand{\id}{\mathbf{I}}`
			`\newcommand{\ind}{\mathbf{1}}`
			`\newcommand{\0}{\mathbf{0}}`
			`\newcommand{\unit}{\mathbf{e}}`
			`\newcommand{\one}{\mathbf{1}}`
			`\newcommand{\zero}{\mathbf{0}}`
			\]`


			In MLlib, we implement popular linear methods such as logistic regression and linear least squares with L1 or L2 regularization. Refer to [the linear methods in mllib](mllib-linear-methods.html) for details. In `spark.ml`, we also include Pipelines API for [Elastic net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid of L1 and L2 regularization proposed in [this paper](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf). Mathematically it is defined as a linear combination of the L1-norm and the L2-norm:
			`\[
			`\alpha \\|\wv\\|_1 + (1-\alpha) \frac{1}{2}\\|\wv\\|_2^2, \alpha \in [0, 1].`
			\]`
			By setting $\alpha$ properly, it contains both L1 and L2 regularization as special cases. For example, if a [linear regression](https://en.wikipedia.org/wiki/Linear_regression) model is trained with the elastic net parameter $\alpha$ set to $1$, it is equivalent to a [Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model. On the other hand, if $\alpha$ is set to $0$, the trained model reduces to a [ridge regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model. We implement Pipelines API for both linear regression and logistic regression with elastic net regularization.

			`Examples`

			`<div class="codetabs">`

			`<div data-lang="scala" markdown="1">`

			`{% highlight scala %}`

			`import org.apache.spark.ml.classification.LogisticRegression`
			`import org.apache.spark.mllib.util.MLUtils`

			`// Load training data`
			`val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()`

			`val lr = new LogisticRegression()`
			`.setMaxIter(10)`
			`.setRegParam(0.3)`
			`.setElasticNetParam(0.8)`

			`// Fit the model`
			`val lrModel = lr.fit(training)`

			`// Print the weights and intercept for logistic regression`
			`println(s"Weights: ${lrModel.weights} Intercept: ${lrModel.intercept}")`

			`{% endhighlight %}`

			`</div>`

			`<div data-lang="java" markdown="1">`

			`{% highlight java %}`

			`import org.apache.spark.ml.classification.LogisticRegression;`
			`import org.apache.spark.ml.classification.LogisticRegressionModel;`
			`import org.apache.spark.mllib.regression.LabeledPoint;`
			`import org.apache.spark.mllib.util.MLUtils;`
			`import org.apache.spark.SparkConf;`
			`import org.apache.spark.SparkContext;`
			`import org.apache.spark.sql.DataFrame;`
			`import org.apache.spark.sql.SQLContext;`

			`public class LogisticRegressionWithElasticNetExample {`
			`public static void main(String[] args) {`
			`SparkConf conf = new SparkConf()`
			`.setAppName("Logistic Regression with Elastic Net Example");`

			`SparkContext sc = new SparkContext(conf);`
			`SQLContext sql = new SQLContext(sc);`
			`String path = "sample_libsvm_data.txt";`

			`// Load training data`
			`DataFrame training = sql.createDataFrame(MLUtils.loadLibSVMFile(sc, path).toJavaRDD(), LabeledPoint.class);`

			`LogisticRegression lr = new LogisticRegression()`
			`.setMaxIter(10)`
			`.setRegParam(0.3)`
			`.setElasticNetParam(0.8)`

			`// Fit the model`
			`LogisticRegressionModel lrModel = lr.fit(training);`

			`// Print the weights and intercept for logistic regression`
			`System.out.println("Weights: " + lrModel.weights() + " Intercept: " + lrModel.intercept());`
			`}`
			`}`
			`{% endhighlight %}`
			`</div>`

			`<div data-lang="python" markdown="1">`

			`{% highlight python %}`

			`from pyspark.ml.classification import LogisticRegression`
			`from pyspark.mllib.regression import LabeledPoint`
			`from pyspark.mllib.util import MLUtils`

			`# Load training data`
			`training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()`

			`lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)`

			`# Fit the model`
			`lrModel = lr.fit(training)`

			`# Print the weights and intercept for logistic regression`
			`print("Weights: " + str(lrModel.weights))`
			`print("Intercept: " + str(lrModel.intercept))`
			`{% endhighlight %}`

			`</div>`

			`</div>`

			`### Optimization`

			`The optimization algorithm underlies the implementation is called [Orthant-Wise Limited-memory QuasiNewton](http://research-srv.microsoft.com/en-us/um/people/jfgao/paper/icml07scalable.pdf)`
			`(OWL-QN). It is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.`