We implement Pipelines API for both linear regression and logistic
regression with elastic net regularization.
# Classification
## Logistic regression
Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression).
> The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
> When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.
<!--- TODO: Add python model summaries once implemented -->
<divdata-lang="python"markdown="1">
Logistic regression model summary is not yet supported in Python.
</div>
</div>
## Decision tree classifier
Decision trees are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
**Example**
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
<divclass="codetabs">
<divdata-lang="scala"markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier).
Random forests are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
**Example**
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
<divclass="codetabs">
<divdata-lang="scala"markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier) for more details.
Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees.
More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).
**Example**
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
<divclass="codetabs">
<divdata-lang="scala"markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.GBTClassifier) for more details.
Multilayer perceptron classifier (MLPC) is a classifier based on the [feedforward artificial neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network).
Each layer is fully connected to the next layer in the network. Nodes in the input layer represent the input data. All other nodes map inputs to outputs
by a linear combination of the inputs with the node's weights `$\wv$` and bias `$\bv$` and applying an activation function.
This can be written in matrix form for MLPC with `$K+1$` layers as follows:
[OneVsRest](http://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently. It is also known as "One-vs-All."
`OneVsRest` is implemented as an `Estimator`. For the base classifier it takes instances of `Classifier` and creates a binary classification problem for each of the k classes. The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.
Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.
**Example**
The example below demonstrates how to load the
[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.
> When fitting LinearRegressionModel without intercept on dataset with constant nonzero column by "l-bfgs" solver, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.
Contrasted with linear regression where the output is assumed to follow a Gaussian
distribution, [generalized linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) (GLMs) are specifications of linear models where the response variable $Y_i$ follows some
distribution from the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).
Spark's `GeneralizedLinearRegression` interface
allows for flexible specification of GLMs which can be used for various types of
prediction problems including linear regression, Poisson regression, logistic regression, and others.
Currently in `spark.ml`, only a subset of the exponential family distributions are supported and they are listed
[below](#available-families).
**NOTE**: Spark currently only supports up to 4096 features through its `GeneralizedLinearRegression`
interface, and will throw an exception if this constraint is exceeded. See the [advanced section](ml-advanced) for more details.
Still, for linear and logistic regression, models with an increased number of features can be trained
using the `LinearRegression` and `LogisticRegression` estimators.
GLMs require exponential family distributions that can be written in their "canonical" or "natural" form, aka
[natural exponential family distributions](https://en.wikipedia.org/wiki/Natural_exponential_family). The form of a natural exponential family distribution is given as:
where $\theta$ is the parameter of interest and $\tau$ is a dispersion parameter. In a GLM the response variable $Y_i$ is assumed to be drawn from a natural exponential family distribution:
$$
Y_i \sim f\left(\cdot|\theta_i, \tau \right)
$$
where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by
$$
\mu_i = A'(\theta_i)
$$
Here, $A'(\theta_i)$ is defined by the form of the distribution selected. GLMs also allow specification
of a link function, which defines the relationship between the expected value of the response variable $\mu_i$
and the so called _linear predictor_ $\eta_i$:
$$
g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta}
$$
Often, the link function is chosen such that $A' = g^{-1}$, which yields a simplified relationship
between the parameter of interest $\theta$ and the linear predictor $\eta$. In this case, the link
function $g(\mu)$ is said to be the "canonical" link function.
Decision trees are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).
**Example**
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
<divclass="codetabs">
<divdata-lang="scala"markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor).
Random forests are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).
**Example**
The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
<divclass="codetabs">
<divdata-lang="scala"markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor) for more details.
> When fitting AFTSurvivalRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is different from R survival::survreg.
The DataFrame API supports two major tree ensemble algorithms: [Random Forests](http://en.wikipedia.org/wiki/Random_forest) and [Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting).
Both use [`spark.ml` decision trees](ml-classification-regression.html#decision-trees) as their base models.
* more functionality for random forests: estimates of feature importance, as well as the predicted probability of each class (a.k.a. class conditional probabilities) for classification.