[SPARK-9888] [MLLIB] User guide for new LDA features
* Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
This commit is contained in:
parent
7467b52ed0
commit
125205cdb3
|
@ -438,28 +438,125 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath")
|
|||
is a topic model which infers topics from a collection of text documents.
|
||||
LDA can be thought of as a clustering algorithm as follows:
|
||||
|
||||
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
|
||||
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
|
||||
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
|
||||
on a statistical model of how text documents are generated.
|
||||
* Topics correspond to cluster centers, and documents correspond to
|
||||
examples (rows) in a dataset.
|
||||
* Topics and documents both exist in a feature space, where feature
|
||||
vectors are vectors of word counts (bag of words).
|
||||
* Rather than estimating a clustering using a traditional distance, LDA
|
||||
uses a function based on a statistical model of how text documents are
|
||||
generated.
|
||||
|
||||
LDA takes in a collection of documents as vectors of word counts.
|
||||
It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
|
||||
on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:
|
||||
LDA supports different inference algorithms via `setOptimizer` function.
|
||||
`EMLDAOptimizer` learns clustering using
|
||||
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
|
||||
on the likelihood function and yields comprehensive results, while
|
||||
`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
|
||||
variational
|
||||
inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
|
||||
and is generally memory friendly.
|
||||
|
||||
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
|
||||
* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
|
||||
|
||||
LDA takes the following parameters:
|
||||
LDA takes in a collection of documents as vectors of word counts and the
|
||||
following parameters (set using the builder pattern):
|
||||
|
||||
* `k`: Number of topics (i.e., cluster centers)
|
||||
* `maxIterations`: Limit on the number of iterations of EM used for learning
|
||||
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
|
||||
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
|
||||
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
|
||||
* `optimizer`: Optimizer to use for learning the LDA model, either
|
||||
`EMLDAOptimizer` or `OnlineLDAOptimizer`
|
||||
* `docConcentration`: Dirichlet parameter for prior over documents'
|
||||
distributions over topics. Larger values encourage smoother inferred
|
||||
distributions.
|
||||
* `topicConcentration`: Dirichlet parameter for prior over topics'
|
||||
distributions over terms (words). Larger values encourage smoother
|
||||
inferred distributions.
|
||||
* `maxIterations`: Limit on the number of iterations.
|
||||
* `checkpointInterval`: If using checkpointing (set in the Spark
|
||||
configuration), this parameter specifies the frequency with which
|
||||
checkpoints will be created. If `maxIterations` is large, using
|
||||
checkpointing can help reduce shuffle file sizes on disk and help with
|
||||
failure recovery.
|
||||
|
||||
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
|
||||
support prediction on new documents, and it does not have a Python API. These will be added in the future.
|
||||
|
||||
All of MLlib's LDA models support:
|
||||
|
||||
* `describeTopics`: Returns topics as arrays of most important terms and
|
||||
term weights
|
||||
* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
|
||||
is a topic
|
||||
|
||||
*Note*: LDA is still an experimental feature under active development.
|
||||
As a result, certain features are only available in one of the two
|
||||
optimizers / models generated by the optimizer. Currently, a distributed
|
||||
model can be converted into a local model, but not vice-versa.
|
||||
|
||||
The following discussion will describe each optimizer/model pair
|
||||
separately.
|
||||
|
||||
**Expectation Maximization**
|
||||
|
||||
Implemented in
|
||||
[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer)
|
||||
and
|
||||
[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel).
|
||||
|
||||
For the parameters provided to `LDA`:
|
||||
|
||||
* `docConcentration`: Only symmetric priors are supported, so all values
|
||||
in the provided `k`-dimensional vector must be identical. All values
|
||||
must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
|
||||
(uniform `k` dimensional vector with value $(50 / k) + 1$
|
||||
* `topicConcentration`: Only symmetric priors supported. Values must be
|
||||
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
|
||||
* `maxIterations`: The maximum number of EM iterations.
|
||||
|
||||
`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
|
||||
the inferred topics but also the full training corpus and topic
|
||||
distributions for each document in the training corpus. A
|
||||
`DistributedLDAModel` supports:
|
||||
|
||||
* `topTopicsPerDocument`: The top topics and their weights for
|
||||
each document in the training corpus
|
||||
* `topDocumentsPerTopic`: The top documents for each topic and
|
||||
the corresponding weight of the topic in the documents.
|
||||
* `logPrior`: log probability of the estimated topics and
|
||||
document-topic distributions given the hyperparameters
|
||||
`docConcentration` and `topicConcentration`
|
||||
* `logLikelihood`: log likelihood of the training corpus, given the
|
||||
inferred topics and document-topic distributions
|
||||
|
||||
**Online Variational Bayes**
|
||||
|
||||
Implemented in
|
||||
[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html)
|
||||
and
|
||||
[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html).
|
||||
|
||||
For the parameters provided to `LDA`:
|
||||
|
||||
* `docConcentration`: Asymmetric priors can be used by passing in a
|
||||
vector with values equal to the Dirichlet parameter in each of the `k`
|
||||
dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
|
||||
default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
|
||||
* `topicConcentration`: Only symmetric priors supported. Values must be
|
||||
$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
|
||||
* `maxIterations`: Maximum number of minibatches to submit.
|
||||
|
||||
In addition, `OnlineLDAOptimizer` accepts the following parameters:
|
||||
|
||||
* `miniBatchFraction`: Fraction of corpus sampled and used at each
|
||||
iteration
|
||||
* `optimizeDocConcentration`: If set to true, performs maximum-likelihood
|
||||
estimation of the hyperparameter `docConcentration` (aka `alpha`)
|
||||
after each minibatch and sets the optimized `docConcentration` in the
|
||||
returned `LocalLDAModel`
|
||||
* `tau0` and `kappa`: Used for learning-rate decay, which is computed by
|
||||
$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.
|
||||
|
||||
`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the
|
||||
inferred topics. A `LocalLDAModel` supports:
|
||||
|
||||
* `logLikelihood(documents)`: Calculates a lower bound on the provided
|
||||
`documents` given the inferred topics.
|
||||
* `logPerplexity(documents)`: Calculates an upper bound on the
|
||||
perplexity of the provided `documents` given the inferred topics.
|
||||
|
||||
**Examples**
|
||||
|
||||
|
|
|
@ -435,7 +435,6 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
|
|||
}
|
||||
val topicsMat = Matrices.fromBreeze(brzTopics)
|
||||
|
||||
// TODO: initialize with docConcentration, topicConcentration, and gammaShape after SPARK-9940
|
||||
new LocalLDAModel(topicsMat, docConcentration, topicConcentration, gammaShape)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -68,6 +68,7 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext {
|
|||
// Train a model
|
||||
val lda = new LDA()
|
||||
lda.setK(k)
|
||||
.setOptimizer(new EMLDAOptimizer)
|
||||
.setDocConcentration(topicSmoothing)
|
||||
.setTopicConcentration(termSmoothing)
|
||||
.setMaxIterations(5)
|
||||
|
|
Loading…
Reference in a new issue