spark-instrumented-optimizer/docs/ml-clustering.md

---
layout: global
title: Clustering - spark.ml
displayTitle: Clustering - spark.ml
---

In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).

**Table of Contents**

* This will become a table of contents (this text will be scraped).
{:toc}

## K-means

[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
most commonly used clustering algorithms that clusters the data points into a
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.

### Input Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

### Output Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
  </tbody>
</table>

### Example

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.

{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.

{% include_example python/ml/kmeans_example.py %}
</div>
</div>

## Latent Dirichlet allocation (LDA)

`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.

<div class="codetabs">

<div data-lang="scala" markdown="1">

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.

{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
</div>

<div data-lang="java" markdown="1">

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
</div>

<div data-lang="python" markdown="1">

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.

{% include_example python/ml/lda_example.py %}
</div>
</div>

## Bisecting k-means

Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.

### Example

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details.

{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.

{% include_example python/ml/bisecting_k_means_example.py %}
</div>
</div>

## Gaussian Mixture Model (GMM)

A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
each with its own probability. The `spark.ml` implementation uses the
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
algorithm to induce the maximum-likelihood model given a set of samples.

`GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
model.

### Input Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

### Output Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
    <tr>
      <td>probabilityCol</td>
      <td>Vector</td>
      <td>"probability"</td>
      <td>Probability of each cluster</td>
    </tr>
  </tbody>
</table>

### Example

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details.

{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details.

{% include_example python/ml/gaussian_mixture_example.py %}
</div>
</div>
[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722 mengxr feynmanliang yinxusen Sorry for the troubling. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9974 from hhbyyh/ldaMLExample. 2015-11-30 17:56:51 -05:00			`---`
			`layout: global`
[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212. 2015-12-10 15:50:46 -05:00			`title: Clustering - spark.ml`
			`displayTitle: Clustering - spark.ml`
[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722 mengxr feynmanliang yinxusen Sorry for the troubling. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9974 from hhbyyh/ldaMLExample. 2015-11-30 17:56:51 -05:00			`---`

			`In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).`

[SPARK-8517][ML][DOC] Reorganizes the spark.ml user guide This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10207 from thunterdb/spark-8517. 2015-12-08 21:40:21 -05:00			`Table of Contents`

			`* This will become a table of contents (this text will be scraped).`
			`{:toc}`

[SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #10244 from yu-iskw/SPARK-12215. 2015-12-16 13:43:45 -05:00			`## K-means`

			`[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the`
			`most commonly used clustering algorithms that clusters the data points into a`
			`predefined number of clusters. The MLlib implementation includes a parallelized`
			`variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method`
			`called [kmeans\|\|](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).`

			`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.

			`### Input Columns`

			`<table class="table">`
			`<thead>`
			`<tr>`
			`<th align="left">Param name</th>`
			`<th align="left">Type(s)</th>`
			`<th align="left">Default</th>`
			`<th align="left">Description</th>`
			`</tr>`
			`</thead>`
			`<tbody>`
			`<tr>`
			`<td>featuresCol</td>`
			`<td>Vector</td>`
			`<td>"features"</td>`
			`<td>Feature vector</td>`
			`</tr>`
			`</tbody>`
			`</table>`

			`### Output Columns`

			`<table class="table">`
			`<thead>`
			`<tr>`
			`<th align="left">Param name</th>`
			`<th align="left">Type(s)</th>`
			`<th align="left">Default</th>`
			`<th align="left">Description</th>`
			`</tr>`
			`</thead>`
			`<tbody>`
			`<tr>`
			`<td>predictionCol</td>`
			`<td>Int</td>`
			`<td>"prediction"</td>`
			`<td>Predicted cluster center</td>`
			`</tr>`
			`</tbody>`
			`</table>`

			`### Example`

			`<div class="codetabs">`

			`<div data-lang="scala" markdown="1">`
			`Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.`

			`{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}`
			`</div>`

			`<div data-lang="java" markdown="1">`
			`Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.`

			`{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}`
			`</div>`

[SPARK-15149][EXAMPLE][DOC] update kmeans example ## What changes were proposed in this pull request? Python example for ml.kmeans already exists, but not included in user guide. 1,small changes like: `example_on` `example_off` 2,add it to user guide 3,update examples to directly read datafile ## How was this patch tested? manual tests `./bin/spark-submit examples/src/main/python/ml/kmeans_example.py Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12925 from zhengruifeng/km_pe. 2016-05-11 04:01:43 -04:00			`<div data-lang="python" markdown="1">`
			`Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.`

			`{% include_example python/ml/kmeans_example.py %}`
			`</div>`
[SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #10244 from yu-iskw/SPARK-12215. 2015-12-16 13:43:45 -05:00			`</div>`

[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722 mengxr feynmanliang yinxusen Sorry for the troubling. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9974 from hhbyyh/ldaMLExample. 2015-11-30 17:56:51 -05:00			`## Latent Dirichlet allocation (LDA)`

			`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
[SPARK-15394][ML][DOCS] User guide typos and grammar audit ## What changes were proposed in this pull request? Correct some typos and incorrectly worded sentences. ## How was this patch tested? Doc changes only. Note that many of these changes were identified by whomfire01 Author: sethah <seth.hendrickson16@gmail.com> Closes #13180 from sethah/ml_guide_audit. 2016-05-20 02:29:37 -04:00			and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml jira: https://issues.apache.org/jira/browse/SPARK-11689 Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions. Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722 mengxr feynmanliang yinxusen Sorry for the troubling. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9974 from hhbyyh/ldaMLExample. 2015-11-30 17:56:51 -05:00			`EMLDAOptimizer` to a `DistributedLDAModel` if needed.

			`<div class="codetabs">`

			`<div data-lang="scala" markdown="1">`

			`Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.`

			`{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}`
			`</div>`

			`<div data-lang="java" markdown="1">`

			`Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.`

			`{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}`
			`</div>`

[SPARK-15150][EXAMPLE][DOC] Update LDA examples ## What changes were proposed in this pull request? 1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt` 2,add python example 3,directly read the datafile in examples 4,BTW, change to `SparkSession` in `aft_survival_regression.py` ## How was this patch tested? manual tests `./bin/spark-submit examples/src/main/python/ml/lda_example.py` Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12927 from zhengruifeng/lda_pe. 2016-05-11 06:49:41 -04:00			`<div data-lang="python" markdown="1">`

			`Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.`
[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans ## What changes were proposed in this pull request? 1, add BisectingKMeans to ml-clustering.md 2, add the missing Scala BisectingKMeansExample 3, create a new datafile `data/mllib/sample_kmeans_data.txt` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11844 from zhengruifeng/doc_bkm. 2016-05-11 03:56:36 -04:00
[SPARK-15150][EXAMPLE][DOC] Update LDA examples ## What changes were proposed in this pull request? 1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt` 2,add python example 3,directly read the datafile in examples 4,BTW, change to `SparkSession` in `aft_survival_regression.py` ## How was this patch tested? manual tests `./bin/spark-submit examples/src/main/python/ml/lda_example.py` Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12927 from zhengruifeng/lda_pe. 2016-05-11 06:49:41 -04:00			`{% include_example python/ml/lda_example.py %}`
			`</div>`
			`</div>`
[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans ## What changes were proposed in this pull request? 1, add BisectingKMeans to ml-clustering.md 2, add the missing Scala BisectingKMeansExample 3, create a new datafile `data/mllib/sample_kmeans_data.txt` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11844 from zhengruifeng/doc_bkm. 2016-05-11 03:56:36 -04:00
[SPARK-15305][ML][DOC] spark.ml document Bisectiong k-means has the incorrect format ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) The generated document has the incorrect format for biseckmeans. ![bug](https://cloud.githubusercontent.com/assets/5033592/15233120/d910098a-185a-11e6-901d-44aeafc8a011.jpg) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Fix the formatting. ![fix](https://cloud.githubusercontent.com/assets/5033592/15233136/fce2ccd0-185a-11e6-9ded-14d71da4bdab.jpg) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13083 from wangmiao1981/doc. 2016-05-16 02:22:16 -04:00			`## Bisecting k-means`
[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans ## What changes were proposed in this pull request? 1, add BisectingKMeans to ml-clustering.md 2, add the missing Scala BisectingKMeansExample 3, create a new datafile `data/mllib/sample_kmeans_data.txt` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11844 from zhengruifeng/doc_bkm. 2016-05-11 03:56:36 -04:00
			`Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a`
			`divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one`
			`moves down the hierarchy.`

			`Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.`

			`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.

			`### Example`

			`<div class="codetabs">`

			`<div data-lang="scala" markdown="1">`
			`Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details.`

			`{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}`
			`</div>`

			`<div data-lang="java" markdown="1">`
			`Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.`

			`{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}`
			`</div>`

			`<div data-lang="python" markdown="1">`
			`Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.`

			`{% include_example python/ml/bisecting_k_means_example.py %}`
			`</div>`
			`</div>`
[SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.ml ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual compile and test all examples Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12788 from wangmiao1981/example. 2016-05-17 09:20:47 -04:00
			`## Gaussian Mixture Model (GMM)`

			`A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)`
			`represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions,`
			each with its own probability. The `spark.ml` implementation uses the
			`[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)`
			`algorithm to induce the maximum-likelihood model given a set of samples.`

			`GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
			`model.`

			`### Input Columns`

			`<table class="table">`
			`<thead>`
			`<tr>`
			`<th align="left">Param name</th>`
			`<th align="left">Type(s)</th>`
			`<th align="left">Default</th>`
			`<th align="left">Description</th>`
			`</tr>`
			`</thead>`
			`<tbody>`
			`<tr>`
			`<td>featuresCol</td>`
			`<td>Vector</td>`
			`<td>"features"</td>`
			`<td>Feature vector</td>`
			`</tr>`
			`</tbody>`
			`</table>`

			`### Output Columns`

			`<table class="table">`
			`<thead>`
			`<tr>`
			`<th align="left">Param name</th>`
			`<th align="left">Type(s)</th>`
			`<th align="left">Default</th>`
			`<th align="left">Description</th>`
			`</tr>`
			`</thead>`
			`<tbody>`
			`<tr>`
			`<td>predictionCol</td>`
			`<td>Int</td>`
			`<td>"prediction"</td>`
			`<td>Predicted cluster center</td>`
			`</tr>`
			`<tr>`
			`<td>probabilityCol</td>`
			`<td>Vector</td>`
			`<td>"probability"</td>`
			`<td>Probability of each cluster</td>`
			`</tr>`
			`</tbody>`
			`</table>`

			`### Example`

			`<div class="codetabs">`

			`<div data-lang="scala" markdown="1">`
			`Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details.`

			`{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}`
			`</div>`

			`<div data-lang="java" markdown="1">`
			`Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.`

			`{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}`
			`</div>`

			`<div data-lang="python" markdown="1">`
			`Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details.`

			`{% include_example python/ml/gaussian_mixture_example.py %}`
			`</div>`
			`</div>`