A Spectral Algorithm for Latent Dirichlet Allocation

Loading...
Thumbnail Image
Penn collection
Statistics Papers
Degree type
Discipline
Subject
topic models
mixture models
method of moments latent dirichlet allocation
Statistics and Probability
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Anandkumar, Anima
Foster, Dean P
Hsu, Daniel
Kakade, Sham
Liu, Yi-Kai
Contributor
Abstract

Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. The increased representational power comes at the cost of a more challenging unsupervised learning problem for estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2015-05-01
Journal title
Algorithmica
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation
Collection