## What changes were proposed in this pull request? Add python example for Power Iteration Clustering in spark.ml ## How was this patch tested? Manually tested Closes #22996 from huaxingao/spark-25997. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
9.1 KiB
layout | title | displayTitle |
---|---|---|
global | Clustering | Clustering |
This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms.
Table of Contents
- This will become a table of contents (this text will be scraped). {:toc}
K-means
k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.
KMeans
is implemented as an Estimator
and generates a KMeansModel
as the base model.
Input Columns
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | "features" | Feature vector |
Output Columns
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | "prediction" | Predicted cluster center |
Examples
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
{% include_example python/ml/kmeans_example.py %}
Refer to the R API docs for more details.
{% include_example r/ml/kmeans.R %}
Latent Dirichlet allocation (LDA)
LDA
is implemented as an Estimator
that supports both EMLDAOptimizer
and OnlineLDAOptimizer
,
and generates a LDAModel
as the base model. Expert users may cast a LDAModel
generated by
EMLDAOptimizer
to a DistributedLDAModel
if needed.
Examples
Refer to the Scala API docs for more details.
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
Refer to the Java API docs for more details.
{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}
Refer to the Python API docs for more details.
{% include_example python/ml/lda_example.py %}
Refer to the R API docs for more details.
{% include_example r/ml/lda.R %}
Bisecting k-means
Bisecting k-means is a kind of hierarchical clustering using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.
BisectingKMeans
is implemented as an Estimator
and generates a BisectingKMeansModel
as the base model.
Examples
{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
{% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}
{% include_example python/ml/bisecting_k_means_example.py %}
Refer to the R API docs for more details.
{% include_example r/ml/bisectingKmeans.R %}
Gaussian Mixture Model (GMM)
A Gaussian Mixture Model
represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions,
each with its own probability. The spark.ml
implementation uses the
expectation-maximization
algorithm to induce the maximum-likelihood model given a set of samples.
GaussianMixture
is implemented as an Estimator
and generates a GaussianMixtureModel
as the base
model.
Input Columns
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | "features" | Feature vector |
Output Columns
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | "prediction" | Predicted cluster center |
probabilityCol | Vector | "probability" | Probability of each cluster |
Examples
{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
{% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}
{% include_example python/ml/gaussian_mixture_example.py %}
Refer to the R API docs for more details.
{% include_example r/ml/gaussianMixture.R %}
Power Iteration Clustering (PIC)
Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.
spark.ml
's PowerIterationClustering implementation takes the following parameters:
k
: the number of clusters to createinitMode
: param for the initialization algorithmmaxIter
: param for maximum number of iterationssrcCol
: param for the name of the input column for source vertex IDsdstCol
: name of the input column for destination vertex IDsweightCol
: Param for weight column name
Examples
{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
{% include_example python/ml/power_iteration_clustering_example.py %}
Refer to the R API docs for more details.
{% include_example r/ml/powerIterationClustering.R %}