[SPARK-5539][MLLIB] LDA guide
This is the LDA user guide from jkbradley with Java and Scala code example. Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4465 from mengxr/lda-guide and squashes the following commits: 6dcb7d1 [Xiangrui Meng] update java example in the user guide 76169ff [Xiangrui Meng] update java example 36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
This commit is contained in:
parent
4575c5643a
commit
855d12ac0a
12
data/mllib/sample_lda_data.txt
Normal file
12
data/mllib/sample_lda_data.txt
Normal file
|
@ -0,0 +1,12 @@
|
|||
1 2 6 0 2 3 1 1 0 0 3
|
||||
1 3 0 1 3 0 0 2 0 0 1
|
||||
1 4 1 0 0 4 9 0 1 2 0
|
||||
2 1 0 3 0 0 5 0 2 3 9
|
||||
3 1 1 9 3 0 2 0 0 1 3
|
||||
4 2 0 3 4 5 1 1 1 4 0
|
||||
2 1 0 3 0 0 5 0 2 2 9
|
||||
1 1 1 9 2 1 2 0 0 1 3
|
||||
4 4 0 3 4 2 1 3 0 0 0
|
||||
2 8 2 0 3 0 2 0 2 7 2
|
||||
1 1 1 9 0 2 2 0 0 3 3
|
||||
4 1 0 0 4 5 1 3 0 1 0
|
|
@ -55,7 +55,7 @@ has the following parameters:
|
|||
|
||||
Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
|
||||
|
||||
* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
|
||||
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
|
||||
* calculates the principal eigenvalue and eigenvector
|
||||
* Clusters each of the input points according to their principal eigenvector component value
|
||||
|
||||
|
@ -71,6 +71,35 @@ Example outputs for a dataset inspired by the paper - but with five clusters ins
|
|||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
### Latent Dirichlet Allocation (LDA)
|
||||
|
||||
[Latent Dirichlet Allocation (LDA)](http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
|
||||
is a topic model which infers topics from a collection of text documents.
|
||||
LDA can be thought of as a clustering algorithm as follows:
|
||||
|
||||
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
|
||||
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
|
||||
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
|
||||
on a statistical model of how text documents are generated.
|
||||
|
||||
LDA takes in a collection of documents as vectors of word counts.
|
||||
It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
|
||||
on the likelihood function. After fitting on the documents, LDA provides:
|
||||
|
||||
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
|
||||
* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
|
||||
|
||||
LDA takes the following parameters:
|
||||
|
||||
* `k`: Number of topics (i.e., cluster centers)
|
||||
* `maxIterations`: Limit on the number of iterations of EM used for learning
|
||||
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
|
||||
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
|
||||
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
|
||||
|
||||
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
|
||||
support prediction on new documents, and it does not have a Python API. These will be added in the future.
|
||||
|
||||
### Examples
|
||||
|
||||
#### k-means
|
||||
|
@ -293,6 +322,104 @@ for i in range(2):
|
|||
|
||||
</div>
|
||||
|
||||
#### Latent Dirichlet Allocation (LDA) Example
|
||||
|
||||
In the following example, we load word count vectors representing a corpus of documents.
|
||||
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
|
||||
to infer three topics from the documents. The number of desired clusters is passed
|
||||
to the algorithm. We then output the topics, represented as probability distributions over words.
|
||||
|
||||
<div class="codetabs">
|
||||
<div data-lang="scala" markdown="1">
|
||||
|
||||
{% highlight scala %}
|
||||
import org.apache.spark.mllib.clustering.LDA
|
||||
import org.apache.spark.mllib.linalg.Vectors
|
||||
|
||||
// Load and parse the data
|
||||
val data = sc.textFile("data/mllib/sample_lda_data.txt")
|
||||
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
|
||||
// Index documents with unique IDs
|
||||
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
|
||||
|
||||
// Cluster the documents into three topics using LDA
|
||||
val ldaModel = new LDA().setK(3).run(corpus)
|
||||
|
||||
// Output topics. Each is a distribution over words (matching word count vectors)
|
||||
println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):")
|
||||
val topics = ldaModel.topicsMatrix
|
||||
for (topic <- Range(0, 3)) {
|
||||
print("Topic " + topic + ":")
|
||||
for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); }
|
||||
println()
|
||||
}
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
|
||||
<div data-lang="java" markdown="1">
|
||||
{% highlight java %}
|
||||
import scala.Tuple2;
|
||||
|
||||
import org.apache.spark.api.java.*;
|
||||
import org.apache.spark.api.java.function.Function;
|
||||
import org.apache.spark.mllib.clustering.DistributedLDAModel;
|
||||
import org.apache.spark.mllib.clustering.LDA;
|
||||
import org.apache.spark.mllib.linalg.Matrix;
|
||||
import org.apache.spark.mllib.linalg.Vector;
|
||||
import org.apache.spark.mllib.linalg.Vectors;
|
||||
import org.apache.spark.SparkConf;
|
||||
|
||||
public class JavaLDAExample {
|
||||
public static void main(String[] args) {
|
||||
SparkConf conf = new SparkConf().setAppName("LDA Example");
|
||||
JavaSparkContext sc = new JavaSparkContext(conf);
|
||||
|
||||
// Load and parse the data
|
||||
String path = "data/mllib/sample_lda_data.txt";
|
||||
JavaRDD<String> data = sc.textFile(path);
|
||||
JavaRDD<Vector> parsedData = data.map(
|
||||
new Function<String, Vector>() {
|
||||
public Vector call(String s) {
|
||||
String[] sarray = s.trim().split(" ");
|
||||
double[] values = new double[sarray.length];
|
||||
for (int i = 0; i < sarray.length; i++)
|
||||
values[i] = Double.parseDouble(sarray[i]);
|
||||
return Vectors.dense(values);
|
||||
}
|
||||
}
|
||||
);
|
||||
// Index documents with unique IDs
|
||||
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
|
||||
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
|
||||
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
|
||||
return doc_id.swap();
|
||||
}
|
||||
}
|
||||
));
|
||||
corpus.cache();
|
||||
|
||||
// Cluster the documents into three topics using LDA
|
||||
DistributedLDAModel ldaModel = new LDA().setK(3).run(corpus);
|
||||
|
||||
// Output topics. Each is a distribution over words (matching word count vectors)
|
||||
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
|
||||
+ " words):");
|
||||
Matrix topics = ldaModel.topicsMatrix();
|
||||
for (int topic = 0; topic < 3; topic++) {
|
||||
System.out.print("Topic " + topic + ":");
|
||||
for (int word = 0; word < ldaModel.vocabSize(); word++) {
|
||||
System.out.print(" " + topics.apply(word, topic));
|
||||
}
|
||||
System.out.println();
|
||||
}
|
||||
}
|
||||
}
|
||||
{% endhighlight %}
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
In order to run the above application, follow the instructions
|
||||
provided in the [Self-Contained Applications](quick-start.html#self-contained-applications)
|
||||
section of the Spark
|
||||
|
|
|
@ -0,0 +1,75 @@
|
|||
/*
|
||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
* contributor license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright ownership.
|
||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
* (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package org.apache.spark.examples.mllib;
|
||||
|
||||
import scala.Tuple2;
|
||||
|
||||
import org.apache.spark.api.java.*;
|
||||
import org.apache.spark.api.java.function.Function;
|
||||
import org.apache.spark.mllib.clustering.DistributedLDAModel;
|
||||
import org.apache.spark.mllib.clustering.LDA;
|
||||
import org.apache.spark.mllib.linalg.Matrix;
|
||||
import org.apache.spark.mllib.linalg.Vector;
|
||||
import org.apache.spark.mllib.linalg.Vectors;
|
||||
import org.apache.spark.SparkConf;
|
||||
|
||||
public class JavaLDAExample {
|
||||
public static void main(String[] args) {
|
||||
SparkConf conf = new SparkConf().setAppName("LDA Example");
|
||||
JavaSparkContext sc = new JavaSparkContext(conf);
|
||||
|
||||
// Load and parse the data
|
||||
String path = "data/mllib/sample_lda_data.txt";
|
||||
JavaRDD<String> data = sc.textFile(path);
|
||||
JavaRDD<Vector> parsedData = data.map(
|
||||
new Function<String, Vector>() {
|
||||
public Vector call(String s) {
|
||||
String[] sarray = s.trim().split(" ");
|
||||
double[] values = new double[sarray.length];
|
||||
for (int i = 0; i < sarray.length; i++)
|
||||
values[i] = Double.parseDouble(sarray[i]);
|
||||
return Vectors.dense(values);
|
||||
}
|
||||
}
|
||||
);
|
||||
// Index documents with unique IDs
|
||||
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
|
||||
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
|
||||
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
|
||||
return doc_id.swap();
|
||||
}
|
||||
}
|
||||
));
|
||||
corpus.cache();
|
||||
|
||||
// Cluster the documents into three topics using LDA
|
||||
DistributedLDAModel ldaModel = new LDA().setK(3).run(corpus);
|
||||
|
||||
// Output topics. Each is a distribution over words (matching word count vectors)
|
||||
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
|
||||
+ " words):");
|
||||
Matrix topics = ldaModel.topicsMatrix();
|
||||
for (int topic = 0; topic < 3; topic++) {
|
||||
System.out.print("Topic " + topic + ":");
|
||||
for (int word = 0; word < ldaModel.vocabSize(); word++) {
|
||||
System.out.print(" " + topics.apply(word, topic));
|
||||
}
|
||||
System.out.println();
|
||||
}
|
||||
}
|
||||
}
|
Loading…
Reference in a new issue