[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example codes
This PR includes: * Update SparkR:::glm, SparkR:::summary API docs. * Update SparkR machine learning user guide and example codes to show: * supporting feature interaction in R formula. * summary for gaussian GLM model. * coefficients for binomial GLM model. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9727 from yanboliang/spark-11684.
This commit is contained in:
parent
e391abdf2c
commit
e222d75849
|
@ -32,6 +32,12 @@ setClass("PipelineModel", representation(model = "jobj"))
|
||||||
#' @param family Error distribution. "gaussian" -> linear regression, "binomial" -> logistic reg.
|
#' @param family Error distribution. "gaussian" -> linear regression, "binomial" -> logistic reg.
|
||||||
#' @param lambda Regularization parameter
|
#' @param lambda Regularization parameter
|
||||||
#' @param alpha Elastic-net mixing parameter (see glmnet's documentation for details)
|
#' @param alpha Elastic-net mixing parameter (see glmnet's documentation for details)
|
||||||
|
#' @param standardize Whether to standardize features before training
|
||||||
|
#' @param solver The solver algorithm used for optimization, this can be "l-bfgs", "normal" and
|
||||||
|
#' "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory
|
||||||
|
#' quasi-Newton optimization method. "normal" denotes using Normal Equation as an
|
||||||
|
#' analytical solution to the linear regression problem. The default value is "auto"
|
||||||
|
#' which means that the solver algorithm is selected automatically.
|
||||||
#' @return a fitted MLlib model
|
#' @return a fitted MLlib model
|
||||||
#' @rdname glm
|
#' @rdname glm
|
||||||
#' @export
|
#' @export
|
||||||
|
@ -79,9 +85,15 @@ setMethod("predict", signature(object = "PipelineModel"),
|
||||||
#'
|
#'
|
||||||
#' Returns the summary of a model produced by glm(), similarly to R's summary().
|
#' Returns the summary of a model produced by glm(), similarly to R's summary().
|
||||||
#'
|
#'
|
||||||
#' @param x A fitted MLlib model
|
#' @param object A fitted MLlib model
|
||||||
#' @return a list with a 'coefficient' component, which is the matrix of coefficients. See
|
#' @return a list with 'devianceResiduals' and 'coefficients' components for gaussian family
|
||||||
#' summary.glm for more information.
|
#' or a list with 'coefficients' component for binomial family. \cr
|
||||||
|
#' For gaussian family: the 'devianceResiduals' gives the min/max deviance residuals
|
||||||
|
#' of the estimation, the 'coefficients' gives the estimated coefficients and their
|
||||||
|
#' estimated standard errors, t values and p-values. (It only available when model
|
||||||
|
#' fitted by normal solver.) \cr
|
||||||
|
#' For binomial family: the 'coefficients' gives the estimated coefficients.
|
||||||
|
#' See summary.glm for more information. \cr
|
||||||
#' @rdname summary
|
#' @rdname summary
|
||||||
#' @export
|
#' @export
|
||||||
#' @examples
|
#' @examples
|
||||||
|
|
|
@ -286,24 +286,37 @@ head(teenagers)
|
||||||
|
|
||||||
# Machine Learning
|
# Machine Learning
|
||||||
|
|
||||||
SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR.
|
SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', ':', '+', and '-'.
|
||||||
|
|
||||||
|
The [summary()](api/R/summary.html) function gives the summary of a model produced by [glm()](api/R/glm.html).
|
||||||
|
|
||||||
|
* For gaussian GLM model, it returns a list with 'devianceResiduals' and 'coefficients' components. The 'devianceResiduals' gives the min/max deviance residuals of the estimation; the 'coefficients' gives the estimated coefficients and their estimated standard errors, t values and p-values. (It only available when model fitted by normal solver.)
|
||||||
|
* For binomial GLM model, it returns a list with 'coefficients' component which gives the estimated coefficients.
|
||||||
|
|
||||||
|
The examples below show the use of building gaussian GLM model and binomial GLM model using SparkR.
|
||||||
|
|
||||||
|
## Gaussian GLM model
|
||||||
|
|
||||||
<div data-lang="r" markdown="1">
|
<div data-lang="r" markdown="1">
|
||||||
{% highlight r %}
|
{% highlight r %}
|
||||||
# Create the DataFrame
|
# Create the DataFrame
|
||||||
df <- createDataFrame(sqlContext, iris)
|
df <- createDataFrame(sqlContext, iris)
|
||||||
|
|
||||||
# Fit a linear model over the dataset.
|
# Fit a gaussian GLM model over the dataset.
|
||||||
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
|
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
|
||||||
|
|
||||||
# Model coefficients are returned in a similar format to R's native glm().
|
# Model summary are returned in a similar format to R's native glm().
|
||||||
summary(model)
|
summary(model)
|
||||||
|
##$devianceResiduals
|
||||||
|
## Min Max
|
||||||
|
## -1.307112 1.412532
|
||||||
|
##
|
||||||
##$coefficients
|
##$coefficients
|
||||||
## Estimate
|
## Estimate Std. Error t value Pr(>|t|)
|
||||||
##(Intercept) 2.2513930
|
##(Intercept) 2.251393 0.3697543 6.08889 9.568102e-09
|
||||||
##Sepal_Width 0.8035609
|
##Sepal_Width 0.8035609 0.106339 7.556598 4.187317e-12
|
||||||
##Species_versicolor 1.4587432
|
##Species_versicolor 1.458743 0.1121079 13.01195 0
|
||||||
##Species_virginica 1.9468169
|
##Species_virginica 1.946817 0.100015 19.46525 0
|
||||||
|
|
||||||
# Make predictions based on the model.
|
# Make predictions based on the model.
|
||||||
predictions <- predict(model, newData = df)
|
predictions <- predict(model, newData = df)
|
||||||
|
@ -317,3 +330,24 @@ head(select(predictions, "Sepal_Length", "prediction"))
|
||||||
##6 5.4 5.385281
|
##6 5.4 5.385281
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
## Binomial GLM model
|
||||||
|
|
||||||
|
<div data-lang="r" markdown="1">
|
||||||
|
{% highlight r %}
|
||||||
|
# Create the DataFrame
|
||||||
|
df <- createDataFrame(sqlContext, iris)
|
||||||
|
training <- filter(df, df$Species != "setosa")
|
||||||
|
|
||||||
|
# Fit a binomial GLM model over the dataset.
|
||||||
|
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
|
||||||
|
|
||||||
|
# Model coefficients are returned in a similar format to R's native glm().
|
||||||
|
summary(model)
|
||||||
|
##$coefficients
|
||||||
|
## Estimate
|
||||||
|
##(Intercept) -13.046005
|
||||||
|
##Sepal_Length 1.902373
|
||||||
|
##Sepal_Width 0.404655
|
||||||
|
{% endhighlight %}
|
||||||
|
</div>
|
||||||
|
|
|
@ -145,6 +145,9 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String
|
||||||
/**
|
/**
|
||||||
* Set the solver algorithm used for optimization.
|
* Set the solver algorithm used for optimization.
|
||||||
* In case of linear regression, this can be "l-bfgs", "normal" and "auto".
|
* In case of linear regression, this can be "l-bfgs", "normal" and "auto".
|
||||||
|
* "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton
|
||||||
|
* optimization method. "normal" denotes using Normal Equation as an analytical
|
||||||
|
* solution to the linear regression problem.
|
||||||
* The default value is "auto" which means that the solver algorithm is
|
* The default value is "auto" which means that the solver algorithm is
|
||||||
* selected automatically.
|
* selected automatically.
|
||||||
* @group setParam
|
* @group setParam
|
||||||
|
|
Loading…
Reference in a new issue