c235b83e27
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation. Author: Ameet Talwalkar <atalwalkar@gmail.com> Closes #1908 from atalwalkar/master and squashes the following commits: fe6938a [Ameet Talwalkar] made xiangruis suggested changes 840028b [Ameet Talwalkar] made xiangruis suggested changes 7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
155 lines
6.1 KiB
Markdown
155 lines
6.1 KiB
Markdown
---
|
|
layout: global
|
|
title: Clustering - MLlib
|
|
displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
|
|
## Clustering
|
|
|
|
Clustering is an unsupervised learning problem whereby we aim to group subsets
|
|
of entities with one another based on some notion of similarity. Clustering is
|
|
often used for exploratory analysis and/or as a component of a hierarchical
|
|
supervised learning pipeline (in which distinct classifiers or regression
|
|
models are trained for each cluster).
|
|
|
|
MLlib supports
|
|
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
|
|
the most commonly used clustering algorithms that clusters the data points into
|
|
predefined number of clusters. The MLlib implementation includes a parallelized
|
|
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
|
|
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
|
|
The implementation in MLlib has the following parameters:
|
|
|
|
* *k* is the number of desired clusters.
|
|
* *maxIterations* is the maximum number of iterations to run.
|
|
* *initializationMode* specifies either random initialization or
|
|
initialization via k-means\|\|.
|
|
* *runs* is the number of times to run the k-means algorithm (k-means is not
|
|
guaranteed to find a globally optimal solution, and when run multiple times on
|
|
a given dataset, the algorithm returns the best clustering result).
|
|
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
|
|
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
|
|
|
|
## Examples
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
The following code snippets can be executed in `spark-shell`.
|
|
|
|
In the following example after loading and parsing data, we use the
|
|
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
|
|
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
|
|
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
|
|
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
|
|
|
|
{% highlight scala %}
|
|
import org.apache.spark.mllib.clustering.KMeans
|
|
import org.apache.spark.mllib.linalg.Vectors
|
|
|
|
// Load and parse the data
|
|
val data = sc.textFile("data/mllib/kmeans_data.txt")
|
|
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
|
|
|
|
// Cluster the data into two classes using KMeans
|
|
val numClusters = 2
|
|
val numIterations = 20
|
|
val clusters = KMeans.train(parsedData, numClusters, numIterations)
|
|
|
|
// Evaluate clustering by computing Within Set Sum of Squared Errors
|
|
val WSSSE = clusters.computeCost(parsedData)
|
|
println("Within Set Sum of Squared Errors = " + WSSSE)
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
All of MLlib's methods use Java-friendly types, so you can import and call them there the same
|
|
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
|
|
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
|
|
calling `.rdd()` on your `JavaRDD` object. A standalone application example
|
|
that is equivalent to the provided example in Scala is given below:
|
|
|
|
{% highlight java %}
|
|
import org.apache.spark.api.java.*;
|
|
import org.apache.spark.api.java.function.Function;
|
|
import org.apache.spark.mllib.clustering.KMeans;
|
|
import org.apache.spark.mllib.clustering.KMeansModel;
|
|
import org.apache.spark.mllib.linalg.Vector;
|
|
import org.apache.spark.mllib.linalg.Vectors;
|
|
import org.apache.spark.SparkConf;
|
|
|
|
public class KMeansExample {
|
|
public static void main(String[] args) {
|
|
SparkConf conf = new SparkConf().setAppName("K-means Example");
|
|
JavaSparkContext sc = new JavaSparkContext(conf);
|
|
|
|
// Load and parse data
|
|
String path = "data/mllib/kmeans_data.txt";
|
|
JavaRDD<String> data = sc.textFile(path);
|
|
JavaRDD<Vector> parsedData = data.map(
|
|
new Function<String, Vector>() {
|
|
public Vector call(String s) {
|
|
String[] sarray = s.split(" ");
|
|
double[] values = new double[sarray.length];
|
|
for (int i = 0; i < sarray.length; i++)
|
|
values[i] = Double.parseDouble(sarray[i]);
|
|
return Vectors.dense(values);
|
|
}
|
|
}
|
|
);
|
|
|
|
// Cluster the data into two classes using KMeans
|
|
int numClusters = 2;
|
|
int numIterations = 20;
|
|
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
|
|
|
|
// Evaluate clustering by computing Within Set Sum of Squared Errors
|
|
double WSSSE = clusters.computeCost(parsedData.rdd());
|
|
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
|
|
}
|
|
}
|
|
{% endhighlight %}
|
|
|
|
In order to run the above standalone application, follow the instructions
|
|
provided in the [Standalone
|
|
Applications](quick-start.html#standalone-applications) section of the Spark
|
|
quick-start guide. Be sure to also include *spark-mllib* to your build file as
|
|
a dependency.
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
The following examples can be tested in the PySpark shell.
|
|
|
|
In the following example after loading and parsing data, we use the KMeans object to cluster the
|
|
data into two clusters. The number of desired clusters is passed to the algorithm. We then compute
|
|
Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In
|
|
fact the optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
|
|
|
|
{% highlight python %}
|
|
from pyspark.mllib.clustering import KMeans
|
|
from numpy import array
|
|
from math import sqrt
|
|
|
|
# Load and parse the data
|
|
data = sc.textFile("data/mllib/kmeans_data.txt")
|
|
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
|
|
|
|
# Build the model (cluster the data)
|
|
clusters = KMeans.train(parsedData, 2, maxIterations=10,
|
|
runs=10, initializationMode="random")
|
|
|
|
# Evaluate clustering by computing Within Set Sum of Squared Errors
|
|
def error(point):
|
|
center = clusters.centers[clusters.predict(point)]
|
|
return sqrt(sum([x**2 for x in (point - center)]))
|
|
|
|
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
|
|
print("Within Set Sum of Squared Error = " + str(WSSSE))
|
|
{% endhighlight %}
|
|
</div>
|
|
|
|
</div>
|