spark-instrumented-optimizer/docs/ml-clustering.md

151 lines
4.4 KiB
Markdown
Raw Normal View History

---
layout: global
title: Clustering - spark.ml
displayTitle: Clustering - spark.ml
---
In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).
**Table of Contents**
* This will become a table of contents (this text will be scraped).
{:toc}
## K-means
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
most commonly used clustering algorithms that clusters the data points into a
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.
### Input Columns
<table class="table">
<thead>
<tr>
<th align="left">Param name</th>
<th align="left">Type(s)</th>
<th align="left">Default</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>featuresCol</td>
<td>Vector</td>
<td>"features"</td>
<td>Feature vector</td>
</tr>
</tbody>
</table>
### Output Columns
<table class="table">
<thead>
<tr>
<th align="left">Param name</th>
<th align="left">Type(s)</th>
<th align="left">Default</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>predictionCol</td>
<td>Int</td>
<td>"prediction"</td>
<td>Predicted cluster center</td>
</tr>
</tbody>
</table>
### Example
<div class="codetabs">
<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
</div>
<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
</div>
<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details.
{% include_example python/ml/kmeans_example.py %}
</div>
</div>
## Latent Dirichlet allocation (LDA)
`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.
<div class="codetabs">
<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
</div>
<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.
{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
</div>
<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details.
{% include_example python/ml/lda_example.py %}
</div>
</div>
## Bisecting k-means
Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy.
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.
### Example
<div class="codetabs">
<div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details.
{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
</div>
<div data-lang="java" markdown="1">
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
</div>
<div data-lang="python" markdown="1">
Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details.
{% include_example python/ml/bisecting_k_means_example.py %}
</div>
</div>