spark-instrumented-optimizer/docs/ml-clustering.md at 43304b1758dec141b7fe9ed33cac976d75efdf91

Yu ISHIKAWA 26d70bd2b4 [SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml

cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #10244 from yu-iskw/SPARK-12215.

2015-12-16 10:43:45 -08:00

2.8 KiB

Raw Blame History

layout	title	displayTitle
global	Clustering - spark.ml	Clustering - spark.ml

In this section, we introduce the pipeline API for clustering in mllib.

Table of Contents

This will become a table of contents (this text will be scraped). {:toc}

K-means

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

KMeans is implemented as an Estimator and generates a KMeansModel as the base model.

Input Columns

Param name	Type(s)	Default	Description
featuresCol	Vector	"features"	Feature vector

Output Columns

Param name	Type(s)	Default	Description
predictionCol	Int	"prediction"	Predicted cluster center

Example

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.

{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.

{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}

Latent Dirichlet allocation (LDA)

LDA is implemented as an Estimator that supports both EMLDAOptimizer and OnlineLDAOptimizer, and generates a LDAModel as the base models. Expert users may cast a LDAModel generated by EMLDAOptimizer to a DistributedLDAModel if needed.

Refer to the Scala API docs for more details.

{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}

Refer to the Java API docs for more details.

{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}

2.8 KiB Raw Blame History