2015-11-30 17:56:51 -05:00
|
|
|
---
|
|
|
|
layout: global
|
2016-07-15 16:38:23 -04:00
|
|
|
title: Clustering
|
|
|
|
displayTitle: Clustering
|
2019-03-30 20:49:45 -04:00
|
|
|
license: |
|
|
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
|
|
this work for additional information regarding copyright ownership.
|
|
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
(the "License"); you may not use this file except in compliance with
|
|
|
|
the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
2015-11-30 17:56:51 -05:00
|
|
|
---
|
|
|
|
|
2016-07-15 16:38:23 -04:00
|
|
|
This page describes clustering algorithms in MLlib.
|
|
|
|
The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
|
|
|
|
about these algorithms.
|
2015-11-30 17:56:51 -05:00
|
|
|
|
2015-12-08 21:40:21 -05:00
|
|
|
**Table of Contents**
|
|
|
|
|
|
|
|
* This will become a table of contents (this text will be scraped).
|
|
|
|
{:toc}
|
|
|
|
|
2015-12-16 13:43:45 -05:00
|
|
|
## K-means
|
|
|
|
|
|
|
|
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
|
|
|
|
most commonly used clustering algorithms that clusters the data points into a
|
|
|
|
predefined number of clusters. The MLlib implementation includes a parallelized
|
|
|
|
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
|
|
|
|
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
|
|
|
|
|
|
|
|
`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.
|
|
|
|
|
|
|
|
### Input Columns
|
|
|
|
|
|
|
|
<table class="table">
|
|
|
|
<thead>
|
|
|
|
<tr>
|
|
|
|
<th align="left">Param name</th>
|
|
|
|
<th align="left">Type(s)</th>
|
|
|
|
<th align="left">Default</th>
|
|
|
|
<th align="left">Description</th>
|
|
|
|
</tr>
|
|
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<tr>
|
|
|
|
<td>featuresCol</td>
|
|
|
|
<td>Vector</td>
|
|
|
|
<td>"features"</td>
|
|
|
|
<td>Feature vector</td>
|
|
|
|
</tr>
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
### Output Columns
|
|
|
|
|
|
|
|
<table class="table">
|
|
|
|
<thead>
|
|
|
|
<tr>
|
|
|
|
<th align="left">Param name</th>
|
|
|
|
<th align="left">Type(s)</th>
|
|
|
|
<th align="left">Default</th>
|
|
|
|
<th align="left">Description</th>
|
|
|
|
</tr>
|
|
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<tr>
|
|
|
|
<td>predictionCol</td>
|
|
|
|
<td>Int</td>
|
|
|
|
<td>"prediction"</td>
|
|
|
|
<td>Predicted cluster center</td>
|
|
|
|
</tr>
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
|
2016-11-08 09:04:07 -05:00
|
|
|
**Examples**
|
2015-12-16 13:43:45 -05:00
|
|
|
|
|
|
|
<div class="codetabs">
|
|
|
|
|
|
|
|
<div data-lang="scala" markdown="1">
|
2020-02-16 10:55:03 -05:00
|
|
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details.
|
2015-12-16 13:43:45 -05:00
|
|
|
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
|
|
|
|
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
|
|
|
|
</div>
|
|
|
|
|
2016-05-11 04:01:43 -04:00
|
|
|
<div data-lang="python" markdown="1">
|
2021-01-17 20:06:45 -05:00
|
|
|
Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.KMeans.html) for more details.
|
2016-05-11 04:01:43 -04:00
|
|
|
|
|
|
|
{% include_example python/ml/kmeans_example.py %}
|
|
|
|
</div>
|
2016-12-05 03:39:44 -05:00
|
|
|
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [R API docs](api/R/spark.kmeans.html) for more details.
|
|
|
|
|
2016-12-08 09:19:38 -05:00
|
|
|
{% include_example r/ml/kmeans.R %}
|
2016-12-05 03:39:44 -05:00
|
|
|
</div>
|
|
|
|
|
2015-12-16 13:43:45 -05:00
|
|
|
</div>
|
|
|
|
|
2015-11-30 17:56:51 -05:00
|
|
|
## Latent Dirichlet allocation (LDA)
|
|
|
|
|
|
|
|
`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
|
2016-05-20 02:29:37 -04:00
|
|
|
and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
|
2015-11-30 17:56:51 -05:00
|
|
|
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.
|
|
|
|
|
2016-11-08 09:04:07 -05:00
|
|
|
**Examples**
|
|
|
|
|
2015-11-30 17:56:51 -05:00
|
|
|
<div class="codetabs">
|
|
|
|
|
|
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
|
2020-02-16 10:55:03 -05:00
|
|
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/LDA.html) for more details.
|
2015-11-30 17:56:51 -05:00
|
|
|
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.
|
|
|
|
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
|
|
|
|
</div>
|
|
|
|
|
2016-05-11 06:49:41 -04:00
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
|
2021-01-17 20:06:45 -05:00
|
|
|
Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.LDA.html) for more details.
|
2016-05-11 03:56:36 -04:00
|
|
|
|
2016-05-11 06:49:41 -04:00
|
|
|
{% include_example python/ml/lda_example.py %}
|
|
|
|
</div>
|
2016-12-08 09:19:38 -05:00
|
|
|
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [R API docs](api/R/spark.lda.html) for more details.
|
|
|
|
|
|
|
|
{% include_example r/ml/lda.R %}
|
|
|
|
</div>
|
|
|
|
|
2016-05-11 06:49:41 -04:00
|
|
|
</div>
|
2016-05-11 03:56:36 -04:00
|
|
|
|
2016-05-16 02:22:16 -04:00
|
|
|
## Bisecting k-means
|
2016-05-11 03:56:36 -04:00
|
|
|
|
|
|
|
Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a
|
|
|
|
divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one
|
|
|
|
moves down the hierarchy.
|
|
|
|
|
|
|
|
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
|
|
|
|
|
|
|
|
`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.
|
|
|
|
|
2016-11-08 09:04:07 -05:00
|
|
|
**Examples**
|
2016-05-11 03:56:36 -04:00
|
|
|
|
|
|
|
<div class="codetabs">
|
|
|
|
|
|
|
|
<div data-lang="scala" markdown="1">
|
2020-02-16 10:55:03 -05:00
|
|
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
|
2016-05-11 03:56:36 -04:00
|
|
|
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
|
|
|
|
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="python" markdown="1">
|
2021-01-17 20:06:45 -05:00
|
|
|
Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.BisectingKMeans.html) for more details.
|
2016-05-11 03:56:36 -04:00
|
|
|
|
|
|
|
{% include_example python/ml/bisecting_k_means_example.py %}
|
|
|
|
</div>
|
2017-02-03 15:19:47 -05:00
|
|
|
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details.
|
|
|
|
|
|
|
|
{% include_example r/ml/bisectingKmeans.R %}
|
|
|
|
</div>
|
2016-05-11 03:56:36 -04:00
|
|
|
</div>
|
2016-05-17 09:20:47 -04:00
|
|
|
|
|
|
|
## Gaussian Mixture Model (GMM)
|
|
|
|
|
|
|
|
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
|
|
|
|
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
|
|
|
|
each with its own probability. The `spark.ml` implementation uses the
|
|
|
|
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
|
|
|
|
algorithm to induce the maximum-likelihood model given a set of samples.
|
|
|
|
|
|
|
|
`GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base
|
|
|
|
model.
|
|
|
|
|
|
|
|
### Input Columns
|
|
|
|
|
|
|
|
<table class="table">
|
|
|
|
<thead>
|
|
|
|
<tr>
|
|
|
|
<th align="left">Param name</th>
|
|
|
|
<th align="left">Type(s)</th>
|
|
|
|
<th align="left">Default</th>
|
|
|
|
<th align="left">Description</th>
|
|
|
|
</tr>
|
|
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<tr>
|
|
|
|
<td>featuresCol</td>
|
|
|
|
<td>Vector</td>
|
|
|
|
<td>"features"</td>
|
|
|
|
<td>Feature vector</td>
|
|
|
|
</tr>
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
### Output Columns
|
|
|
|
|
|
|
|
<table class="table">
|
|
|
|
<thead>
|
|
|
|
<tr>
|
|
|
|
<th align="left">Param name</th>
|
|
|
|
<th align="left">Type(s)</th>
|
|
|
|
<th align="left">Default</th>
|
|
|
|
<th align="left">Description</th>
|
|
|
|
</tr>
|
|
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<tr>
|
|
|
|
<td>predictionCol</td>
|
|
|
|
<td>Int</td>
|
|
|
|
<td>"prediction"</td>
|
|
|
|
<td>Predicted cluster center</td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td>probabilityCol</td>
|
|
|
|
<td>Vector</td>
|
|
|
|
<td>"probability"</td>
|
|
|
|
<td>Probability of each cluster</td>
|
|
|
|
</tr>
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
|
2016-11-08 09:04:07 -05:00
|
|
|
**Examples**
|
2016-05-17 09:20:47 -04:00
|
|
|
|
|
|
|
<div class="codetabs">
|
|
|
|
|
|
|
|
<div data-lang="scala" markdown="1">
|
2020-02-16 10:55:03 -05:00
|
|
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
|
2016-05-17 09:20:47 -04:00
|
|
|
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
|
|
|
|
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="python" markdown="1">
|
2021-01-17 20:06:45 -05:00
|
|
|
Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.GaussianMixture.html) for more details.
|
2016-05-17 09:20:47 -04:00
|
|
|
|
|
|
|
{% include_example python/ml/gaussian_mixture_example.py %}
|
|
|
|
</div>
|
2016-12-08 09:19:38 -05:00
|
|
|
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details.
|
|
|
|
|
|
|
|
{% include_example r/ml/gaussianMixture.R %}
|
|
|
|
</div>
|
|
|
|
|
2016-05-17 09:20:47 -04:00
|
|
|
</div>
|
2018-12-10 19:28:13 -05:00
|
|
|
|
|
|
|
## Power Iteration Clustering (PIC)
|
|
|
|
|
|
|
|
Power Iteration Clustering (PIC) is a scalable graph clustering algorithm
|
|
|
|
developed by [Lin and Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf).
|
|
|
|
From the abstract: PIC finds a very low-dimensional embedding of a dataset
|
|
|
|
using truncated power iteration on a normalized pair-wise similarity matrix of the data.
|
|
|
|
|
|
|
|
`spark.ml`'s PowerIterationClustering implementation takes the following parameters:
|
|
|
|
|
|
|
|
* `k`: the number of clusters to create
|
|
|
|
* `initMode`: param for the initialization algorithm
|
|
|
|
* `maxIter`: param for maximum number of iterations
|
|
|
|
* `srcCol`: param for the name of the input column for source vertex IDs
|
|
|
|
* `dstCol`: name of the input column for destination vertex IDs
|
|
|
|
* `weightCol`: Param for weight column name
|
|
|
|
|
|
|
|
**Examples**
|
|
|
|
|
|
|
|
<div class="codetabs">
|
|
|
|
|
|
|
|
<div data-lang="scala" markdown="1">
|
2020-02-16 10:55:03 -05:00
|
|
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
|
2018-12-10 19:28:13 -05:00
|
|
|
|
|
|
|
{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
|
|
|
|
|
|
|
|
{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
|
|
|
|
</div>
|
|
|
|
|
2019-01-31 20:33:44 -05:00
|
|
|
<div data-lang="python" markdown="1">
|
2021-01-17 20:06:45 -05:00
|
|
|
Refer to the [Python API docs](api/python/reference/api/pyspark.ml.clustering.PowerIterationClustering.html) for more details.
|
2019-01-31 20:33:44 -05:00
|
|
|
|
|
|
|
{% include_example python/ml/power_iteration_clustering_example.py %}
|
|
|
|
</div>
|
|
|
|
|
2018-12-10 19:28:13 -05:00
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
|
|
|
|
Refer to the [R API docs](api/R/spark.powerIterationClustering.html) for more details.
|
|
|
|
|
|
|
|
{% include_example r/ml/powerIterationClustering.R %}
|
|
|
|
</div>
|
|
|
|
|
|
|
|
</div>
|