[SPARK-19386][SPARKR][DOC] Bisecting k-means in SparkR documentation
## What changes were proposed in this pull request? Update programming guide, example and vignette with Bisecting k-means. Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16767 from krishnakalyan3/bisecting-kmeans.
This commit is contained in:
parent
2f523fa0c9
commit
48aafeda7d
|
@ -488,6 +488,8 @@ SparkR supports the following machine learning models and algorithms.
|
|||
|
||||
#### Clustering
|
||||
|
||||
* Bisecting $k$-means
|
||||
|
||||
* Gaussian Mixture Model (GMM)
|
||||
|
||||
* $k$-means Clustering
|
||||
|
@ -738,6 +740,18 @@ summary(rfModel)
|
|||
predictions <- predict(rfModel, df)
|
||||
```
|
||||
|
||||
#### Bisecting k-Means
|
||||
|
||||
`spark.bisectingKmeans` is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
|
||||
|
||||
```{r}
|
||||
df <- createDataFrame(iris)
|
||||
model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
|
||||
summary(kmeansModel)
|
||||
fitted <- predict(model, df)
|
||||
head(select(fitted, "Sepal_Length", "prediction"))
|
||||
```
|
||||
|
||||
#### Gaussian Mixture Model
|
||||
|
||||
`spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) (GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) is used to approximate the maximum likelihood estimator (MLE) of the model.
|
||||
|
|
|
@ -167,6 +167,13 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
|
|||
|
||||
{% include_example python/ml/bisecting_k_means_example.py %}
|
||||
</div>
|
||||
|
||||
<div data-lang="r" markdown="1">
|
||||
|
||||
Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details.
|
||||
|
||||
{% include_example r/ml/bisectingKmeans.R %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Gaussian Mixture Model (GMM)
|
||||
|
|
42
examples/src/main/r/ml/bisectingKmeans.R
Normal file
42
examples/src/main/r/ml/bisectingKmeans.R
Normal file
|
@ -0,0 +1,42 @@
|
|||
#
|
||||
# Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
# contributor license agreements. See the NOTICE file distributed with
|
||||
# this work for additional information regarding copyright ownership.
|
||||
# The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
# (the "License"); you may not use this file except in compliance with
|
||||
# the License. You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
# To run this example use
|
||||
# ./bin/spark-submit examples/src/main/r/ml/bisectingKmeans.R
|
||||
|
||||
# Load SparkR library into your R session
|
||||
library(SparkR)
|
||||
|
||||
# Initialize SparkSession
|
||||
sparkR.session(appName = "SparkR-ML-bisectingKmeans-example")
|
||||
|
||||
# $example on$
|
||||
irisDF <- createDataFrame(iris)
|
||||
|
||||
# Fit bisecting k-means model with four centers
|
||||
model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
|
||||
|
||||
# get fitted result from a bisecting k-means model
|
||||
fitted.model <- fitted(model, "centers")
|
||||
|
||||
# Model summary
|
||||
summary(fitted.model)
|
||||
|
||||
# fitted values on training data
|
||||
fitted <- predict(model, df)
|
||||
head(select(fitted, "Sepal_Length", "prediction"))
|
||||
# $example off$
|
Loading…
Reference in a new issue