[SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer
I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3569 from jkbradley/lr-doc and squashes the following commits: 654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization 5035ad0 [Joseph K. Bradley] updated based on review 94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
This commit is contained in:
parent
1826372d0a
commit
27ab0b8a03
|
@ -110,12 +110,16 @@ However, L1 regularization can help promote sparsity in weights leading to small
|
|||
It is not recommended to train models without any regularization,
|
||||
especially when the number of training examples is small.
|
||||
|
||||
### Optimization
|
||||
|
||||
Under the hood, linear methods use convex optimization methods to optimize the objective functions. MLlib uses two methods, SGD and L-BFGS, described in the [optimization section](mllib-optimization.html). Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS. Refer to [this optimization section](mllib-optimization.html#Choosing-an-Optimization-Method) for guidelines on choosing between optimization methods.
|
||||
|
||||
## Binary classification
|
||||
|
||||
[Binary classification](http://en.wikipedia.org/wiki/Binary_classification)
|
||||
aims to divide items into two categories: positive and negative. MLlib
|
||||
supports two linear methods for binary classification: linear support vector
|
||||
machines (SVMs) and logistic regression. For both methods, MLlib supports
|
||||
supports two linear methods for binary classification: linear Support Vector
|
||||
Machines (SVMs) and logistic regression. For both methods, MLlib supports
|
||||
L1 and L2 regularized variants. The training data set is represented by an RDD
|
||||
of [LabeledPoint](mllib-data-types.html) in MLlib. Note that, in the
|
||||
mathematical formulation in this guide, a training label $y$ is denoted as
|
||||
|
@ -123,7 +127,7 @@ either $+1$ (positive) or $-1$ (negative), which is convenient for the
|
|||
formulation. *However*, the negative label is represented by $0$ in MLlib
|
||||
instead of $-1$, to be consistent with multiclass labeling.
|
||||
|
||||
### Linear support vector machines (SVMs)
|
||||
### Linear Support Vector Machines (SVMs)
|
||||
|
||||
The [linear SVM](http://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
|
||||
is a standard method for large-scale classification tasks. It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss function in the formulation given by the hinge loss:
|
||||
|
|
|
@ -138,6 +138,12 @@ vertical scalability issue (the number of training features) when computing the
|
|||
explicitly in Newton's method. As a result, L-BFGS often achieves rapider convergence compared with
|
||||
other first-order optimization.
|
||||
|
||||
### Choosing an Optimization Method
|
||||
|
||||
[Linear methods](mllib-linear-methods.html) use optimization internally, and some linear methods in MLlib support both SGD and L-BFGS.
|
||||
Different optimization methods can have different convergence guarantees depending on the properties of the objective function, and we cannot cover the literature here.
|
||||
In general, when L-BFGS is available, we recommend using it instead of SGD since L-BFGS tends to converge faster (in fewer iterations).
|
||||
|
||||
## Implementation in MLlib
|
||||
|
||||
### Gradient descent and stochastic gradient descent
|
||||
|
@ -168,10 +174,7 @@ descent. All updaters in MLlib use a step size at the t-th step equal to
|
|||
* `regParam` is the regularization parameter when using L1 or L2 regularization.
|
||||
* `miniBatchFraction` is the fraction of the total data that is sampled in
|
||||
each iteration, to compute the gradient direction.
|
||||
|
||||
Available algorithms for gradient descent:
|
||||
|
||||
* [GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
|
||||
* Sampling still requires a pass over the entire RDD, so decreasing `miniBatchFraction` may not speed up optimization much. Users will see the greatest speedup when the gradient is expensive to compute, for only the chosen samples are used for computing the gradient.
|
||||
|
||||
### L-BFGS
|
||||
L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various
|
||||
|
@ -359,13 +362,15 @@ public class LBFGSExample {
|
|||
{% endhighlight %}
|
||||
</div>
|
||||
</div>
|
||||
#### Developer's note
|
||||
|
||||
## Developer's notes
|
||||
|
||||
Since the Hessian is constructed approximately from previous gradient evaluations,
|
||||
the objective function can not be changed during the optimization process.
|
||||
As a result, Stochastic L-BFGS will not work naively by just using miniBatch;
|
||||
therefore, we don't provide this until we have better understanding.
|
||||
|
||||
* `Updater` is a class originally designed for gradient decent which computes
|
||||
`Updater` is a class originally designed for gradient decent which computes
|
||||
the actual gradient descent step. However, we're able to take the gradient and
|
||||
loss of objective function of regularization for L-BFGS by ignoring the part of logic
|
||||
only for gradient decent such as adaptive step size stuff. We will refactorize
|
||||
|
|
Loading…
Reference in a new issue