spark-instrumented-optimizer/docs/ml-clustering.md

---
layout: global
title: Clustering
displayTitle: Clustering
---

This page describes clustering algorithms in MLlib.
The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
about these algorithms.

**Table of Contents**

* This will become a table of contents (this text will be scraped).
{:toc}

## K-means

[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
most commonly used clustering algorithms that clusters the data points into a
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.

### Input Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

### Output Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
  </tbody>
</table>

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.

{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.

{% include_example python/ml/kmeans_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.kmeans.html) for more details.

{% include_example r/ml/kmeans.R %}
</div>

</div>

## Latent Dirichlet allocation (LDA)

`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.

{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
</div>

<div data-lang="java" markdown="1">

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
</div>

<div data-lang="python" markdown="1">

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.

{% include_example python/ml/lda_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.lda.html) for more details.

{% include_example r/ml/lda.R %}
</div>

</div>

## Bisecting k-means

Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details.

{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.

{% include_example python/ml/bisecting_k_means_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details. 

{% include_example r/ml/bisectingKmeans.R %}
</div>
</div>

## Gaussian Mixture Model (GMM)

A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
each with its own probability. The `spark.ml` implementation uses the
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
algorithm to induce the maximum-likelihood model given a set of samples.

`GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
model.

### Input Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

### Output Columns

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
    <tr>
      <td>probabilityCol</td>
      <td>Vector</td>
      <td>"probability"</td>
      <td>Probability of each cluster</td>
    </tr>
  </tbody>
</table>

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details.

{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details.

{% include_example python/ml/gaussian_mixture_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details.

{% include_example r/ml/gaussianMixture.R %}
</div>

</div>

## Power Iteration Clustering (PIC)

Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
developed by [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf).
From the abstract: PIC finds a very low-dimensional embedding of a dataset
using truncated power iteration on a normalized pair-wise similarity matrix of the data.

`spark.ml`'s PowerIterationClustering implementation takes the following parameters:

* `k`: the number of clusters to create
* `initMode`: param for the initialization algorithm
* `maxIter`: param for maximum number of iterations
* `srcCol`: param for the name of the input column for source vertex IDs
* `dstCol`: name of the input column for destination vertex IDs
* `weightCol`: Param for weight column name

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering) for more details.

{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
</div>

<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
</div>

<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.PowerIterationClustering) for more details.

{% include_example python/ml/power_iteration_clustering_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.powerIterationClustering.html) for more details.

{% include_example r/ml/powerIterationClustering.R %}
</div>

</div>