SPARK-5805 Fixed the type error in documentation.
Fixes SPARK-5805 : Fix the type error in the final example given in MLlib - Clustering documentation. Author: Emre Sevinç <emre.sevinc@gmail.com> Closes #4596 from emres/SPARK-5805 and squashes the following commits: 1029f66 [Emre Sevinç] SPARK-5805 Fixed the type error in documentation.
This commit is contained in:
parent
077eec2d9d
commit
9f31db0610
|
@ -14,7 +14,7 @@ Clustering is an unsupervised learning problem whereby we aim to group subsets
|
|||
of entities with one another based on some notion of similarity. Clustering is
|
||||
often used for exploratory analysis and/or as a component of a hierarchical
|
||||
supervised learning pipeline (in which distinct classifiers or regression
|
||||
models are trained for each cluster).
|
||||
models are trained for each cluster).
|
||||
|
||||
MLlib supports the following models:
|
||||
|
||||
|
@ -25,7 +25,7 @@ most commonly used clustering algorithms that clusters the data points into a
|
|||
predefined number of clusters. The MLlib implementation includes a parallelized
|
||||
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
|
||||
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
|
||||
The implementation in MLlib has the following parameters:
|
||||
The implementation in MLlib has the following parameters:
|
||||
|
||||
* *k* is the number of desired clusters.
|
||||
* *maxIterations* is the maximum number of iterations to run.
|
||||
|
@ -35,12 +35,12 @@ initialization via k-means\|\|.
|
|||
guaranteed to find a globally optimal solution, and when run multiple times on
|
||||
a given dataset, the algorithm returns the best clustering result).
|
||||
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
|
||||
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
|
||||
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
|
||||
|
||||
### Gaussian mixture
|
||||
|
||||
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
|
||||
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
|
||||
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
|
||||
each with its own probability. The MLlib implementation uses the
|
||||
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
|
||||
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
|
||||
|
@ -221,8 +221,8 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
|
|||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
In the following example after loading and parsing data, we use a
|
||||
[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
|
||||
object to cluster the data into two clusters. The number of desired clusters is passed
|
||||
[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
|
||||
object to cluster the data into two clusters. The number of desired clusters is passed
|
||||
to the algorithm. We then output the parameters of the mixture model.
|
||||
|
||||
{% highlight scala %}
|
||||
|
@ -238,7 +238,7 @@ val gmm = new GaussianMixture().setK(2).run(parsedData)
|
|||
|
||||
// output parameters of max-likelihood model
|
||||
for (i <- 0 until gmm.k) {
|
||||
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
|
||||
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
|
||||
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
|
||||
}
|
||||
|
||||
|
@ -298,7 +298,7 @@ public class GaussianMixtureExample {
|
|||
<div data-lang="python" markdown="1">
|
||||
In the following example after loading and parsing data, we use a
|
||||
[GaussianMixture](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture)
|
||||
object to cluster the data into two clusters. The number of desired clusters is passed
|
||||
object to cluster the data into two clusters. The number of desired clusters is passed
|
||||
to the algorithm. We then output the parameters of the mixture model.
|
||||
|
||||
{% highlight python %}
|
||||
|
@ -326,7 +326,7 @@ for i in range(2):
|
|||
|
||||
In the following example, we load word count vectors representing a corpus of documents.
|
||||
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
|
||||
to infer three topics from the documents. The number of desired clusters is passed
|
||||
to infer three topics from the documents. The number of desired clusters is passed
|
||||
to the algorithm. We then output the topics, represented as probability distributions over words.
|
||||
|
||||
<div class="codetabs">
|
||||
|
@ -428,27 +428,27 @@ a dependency.
|
|||
|
||||
## Streaming clustering
|
||||
|
||||
When data arrive in a stream, we may want to estimate clusters dynamically,
|
||||
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
|
||||
with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
|
||||
uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
|
||||
When data arrive in a stream, we may want to estimate clusters dynamically,
|
||||
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
|
||||
with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
|
||||
uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
|
||||
all points to their nearest cluster, compute new cluster centers, then update each cluster using:
|
||||
|
||||
`\begin{equation}
|
||||
c_{t+1} = \frac{c_tn_t\alpha + x_tm_t}{n_t\alpha+m_t}
|
||||
\end{equation}`
|
||||
`\begin{equation}
|
||||
n_{t+1} = n_t + m_t
|
||||
n_{t+1} = n_t + m_t
|
||||
\end{equation}`
|
||||
|
||||
Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
|
||||
to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
|
||||
is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
|
||||
can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
|
||||
with `$\alpha$=0` only the most recent data will be used. This is analogous to an
|
||||
exponentially-weighted moving average.
|
||||
Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
|
||||
to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
|
||||
is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
|
||||
can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
|
||||
with `$\alpha$=0` only the most recent data will be used. This is analogous to an
|
||||
exponentially-weighted moving average.
|
||||
|
||||
The decay can be specified using a `halfLife` parameter, which determines the
|
||||
The decay can be specified using a `halfLife` parameter, which determines the
|
||||
correct decay factor `a` such that, for data acquired
|
||||
at time `t`, its contribution by time `t + halfLife` will have dropped to 0.5.
|
||||
The unit of time can be specified either as `batches` or `points` and the update rule
|
||||
|
@ -472,9 +472,9 @@ import org.apache.spark.mllib.clustering.StreamingKMeans
|
|||
|
||||
{% endhighlight %}
|
||||
|
||||
Then we make an input stream of vectors for training, as well as a stream of labeled data
|
||||
points for testing. We assume a StreamingContext `ssc` has been created, see
|
||||
[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
|
||||
Then we make an input stream of vectors for training, as well as a stream of labeled data
|
||||
points for testing. We assume a StreamingContext `ssc` has been created, see
|
||||
[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
|
||||
|
||||
{% highlight scala %}
|
||||
|
||||
|
@ -496,24 +496,24 @@ val model = new StreamingKMeans()
|
|||
|
||||
{% endhighlight %}
|
||||
|
||||
Now register the streams for training and testing and start the job, printing
|
||||
Now register the streams for training and testing and start the job, printing
|
||||
the predicted cluster assignments on new data points as they arrive.
|
||||
|
||||
{% highlight scala %}
|
||||
|
||||
model.trainOn(trainingData)
|
||||
model.predictOnValues(testData).print()
|
||||
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
|
||||
|
||||
ssc.start()
|
||||
ssc.awaitTermination()
|
||||
|
||||
|
||||
{% endhighlight %}
|
||||
|
||||
As you add new text files with data the cluster centers will update. Each training
|
||||
As you add new text files with data the cluster centers will update. Each training
|
||||
point should be formatted as `[x1, x2, x3]`, and each test data point
|
||||
should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
|
||||
(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
|
||||
the model will update. Anytime a text file is placed in `/testing/data/dir`
|
||||
should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
|
||||
(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
|
||||
the model will update. Anytime a text file is placed in `/testing/data/dir`
|
||||
you will see predictions. With new data, the cluster centers will change!
|
||||
|
||||
</div>
|
||||
|
|
Loading…
Reference in a new issue