spark-instrumented-optimizer/examples
Joseph K. Bradley 980764f3c0 [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM
**This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).**

The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it.  In particular, see the API and Planning for the Future sections.

* Settle on a public API which may eventually include:
  * more inference algorithms
  * more options / functionality
* Have an initial easy-to-understand implementation which others may improve.
* This is NOT intended to support every topic model out there.  However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much.
* This may not be very scalable currently.  It will be important to check and improve accuracy.  For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc.

**Dependency: This makes MLlib depend on GraphX.**

Files and classes:
* LDA.scala (441 lines):
  * class LDA (main estimator class)
  * LDA.Document  (text + document ID)
* LDAModel.scala (266 lines)
  * abstract class LDAModel
  * class LocalLDAModel
  * class DistributedLDAModel
* LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer
* LDASuite.scala (144 lines)

Data/model representation and algorithm:
* Data/model: Uses GraphX, with term vertices + document vertices
* Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh.  "On Smoothing and Inference for Topic Models."  UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
* For more details, please see the description in the “DEVELOPERS NOTE” in LDA.scala

Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)

Here, I list the main changes AFTER the design doc was posted.

Design decisions:
* logLikelihood() computes the log likelihood of the data and the current point estimate of parameters.  This is different from the likelihood of the data given the hyperparameters, which would be harder to compute.  I’d describe the current approach as more frequentist, whereas the harder approach would be more Bayesian.
* The current API takes Documents as token count vectors.  I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR.  I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings).
* Hyperparameters should be set differently for different inference/learning algorithms.  See Asuncion et al. (2009) in the design doc for a good demonstration.  I encourage good behavior via defaults and warning messages.

Items planned for future PRs:
* perplexity
* API taking Strings

* Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)?
  * Pro: We may someday want LinearDiscriminantAnalysis.
  * Con: Very long names

* Should LDA reside in clustering?  Or do we want a sub-package?
  * mllib.topicmodel
  * mllib.clustering.topicmodel

* Does the API seem reasonable and extensible?

* Unit tests:
  * Should there be a test which checks a clustering results?  E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters.  Does that sound useful or too flaky?

This has not been tested much for scaling.  I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics.  Running it for 500 iterations made it fail because of GC problems.  I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.

* dlwh  for the initial implementation
  * + jegonzal  for some code in the initial implementation
* The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: akopich witgo yinxusen dlwh EntilZha jegonzal  IlyaKozlov
  * Note: The plan is to include this full list in the authors if this PR gets merged.  Please notify me if you prefer otherwise.

CC: mengxr

Authors:
  Joseph K. Bradley <joseph@databricks.com>
  Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
  David Hall <david.lw.hall@gmail.com>
  Guoqiang Li <witgo@qq.com>
  Xiangrui Meng <meng@databricks.com>
  Pedro Rodriguez <pedro@snowgeek.org>
  Avanesov Valeriy <acopich@gmail.com>
  Xusen Yin <yinxusen@gmail.com>

Closes #2388
Closes #4047 from jkbradley/davidhall-lda and squashes the following commits:

77e8814 [Joseph K. Bradley] small doc fix
5c74345 [Joseph K. Bradley] cleaned up doc based on code review
589728b [Joseph K. Bradley] Updates per code review.  Main change was in LDAExample for faster vocab computation.  Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
e3980d2 [Joseph K. Bradley] cleaned up PeriodicGraphCheckpointerSuite.scala
74487e5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into davidhall-lda
4ae2a7d [Joseph K. Bradley] removed duplicate graphx dependency in mllib/pom.xml
e391474 [Joseph K. Bradley] Removed LDATiming.  Added PeriodicGraphCheckpointerSuite.scala.  Small LDA cleanups.
e8d8acf [Joseph K. Bradley] Added catch for BreakIterator exception.  Improved preprocessing to reduce passes over data
1a231b4 [Joseph K. Bradley] fixed scalastyle
91aadfe [Joseph K. Bradley] Added Java-friendly run method to LDA. Added Java test suite for LDA. Changed LDAModel.describeTopics to return Java-friendly type
b75472d [Joseph K. Bradley] merged improvements from LDATiming into LDAExample.  Will remove LDATiming after done testing
993ca56 [Joseph K. Bradley] * Removed Document type in favor of (Long, Vector) * Changed doc ID restriction to be: id must be nonnegative and unique in the doc (instead of 0,1,2,...) * Add checks for valid ranges of eta, alpha * Rename “LearningState” to “EMOptimizer” * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> docConcentration   * Also added aliases alpha, beta
cb5a319 [Joseph K. Bradley] Added checkpointing to LDA * new class PeriodicGraphCheckpointer * params checkpointDir, checkpointInterval to LDA
43c1c40 [Joseph K. Bradley] small cleanup
0b90393 [Joseph K. Bradley] renamed LDA LearningState.collectTopicTotals to globalTopicTotals
77a2c85 [Joseph K. Bradley] Moved auto term,topic smoothing computation to get*Smoothing methods.  Changed word to term in some places.  Updated LDAExample to use default smoothing amounts.
fb1e7b5 [Xiangrui Meng] minor
08d59a3 [Xiangrui Meng] reset spacing
9fe0b95 [Xiangrui Meng] optimize aggregateMessages
cec0a9c [Xiangrui Meng] * -> *=
6cb11b0 [Xiangrui Meng] optimize computePTopic
9eb3d02 [Xiangrui Meng] + -> +=
892530c [Xiangrui Meng] use axpy
45cc7f2 [Xiangrui Meng] mapPart -> flatMap
ce53be9 [Joseph K. Bradley] fixed example name
75749e7 [Joseph K. Bradley] scala style fix
9f2a492 [Joseph K. Bradley] Unit tests and fixes for LDA, now ready for PR
377ebd9 [Joseph K. Bradley] separated LDA models into own file.  more cleanups before PR
2d40006 [Joseph K. Bradley] cleanups before PR
2891e89 [Joseph K. Bradley] Prepped LDA main class for PR, but some cleanups remain
0cb7187 [Joseph K. Bradley] Added 3 files from dlwh LDA implementation
2015-02-02 23:57:37 -08:00
..
scala-2.10/src/main [SPARK-5233][Streaming] Fix error replaying of WAL introduced bug 2015-01-22 21:58:53 -08:00
src/main [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM 2015-02-02 23:57:37 -08:00
pom.xml [SPARK-4809] Rework Guava library shading. 2015-01-28 00:29:29 -08:00