Now showing 1 - 10 of 30
PublicationLearning Determinantal Point Processes(2011-07-01) Taskar, Ben; Kulesza, AlexDeterminantal point processes (DPPs), which arise in random matrix theory and quantum physics, are natural models for subset selection problems where diversity is preferred. Among many remarkable properties, DPPs other tractable algorithms for exact inference, including computing marginal probabilities and sampling; how- ever, an important open question has been how to learn a DPP from labeled training data. In this paper we propose a natural feature-based parameterization of conditional DPPs, and show how it leads to a convex and efficient learning formulation. We analyze the relationship between our model and binary Markov random fields with repulsive potentials, which are qualitatively similar but computationally intractable. Finally, we apply our approach to the task of extractive summarization, where the goal is to choose a small subset of sentences conveying the most important information from a set of documents. In this task there is a fundamental tradeoff between sentences that are highly relevant to the collection as a whole, and sentences that are diverse and not repetitive. Our parameterization allows us to naturally balance these two characteristics. We evaluate our system on data from the DUC 2003/04 multi- document summarization task, achieving state-of-the-art results. PublicationPosterior Sparsity in Unsupervised Dependency Parsing(2010-01-01) Gillenwater, Jennifer; Ganchev, Kuzman; Graça, João; Pereira, Fernando; Taskar, BenA strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed accuracy over the standard expectation maximization (EM) baseline for 9 of the languages, with an average accuracy improvement of 6%. Further, we show that for 8 out of 12 languages, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors, with an average improvement of 4%. On English text in particular, we show that our approach improves performance over other state of the art techniques. PublicationA permutation-augmented sampler for DP mixture models(2007-06-01) Taskar, Ben; Liang, Percy; Jordan, MichaelWe introduce a new inference algorithm for Dirichlet process mixture models. While Gibbs sampling and variational methods focus on local moves, the new algorithm makes more global moves. This is done by introducing a permutation of the data points as an auxiliary variable. The algorithm is a blocked sampler which alternates between sampling the clustering and sampling the permutation. The key to the efficiency of this approach is that it is possible to use dynamic programming to consider all exponentially many clusterings consistent with a given permutation. We also show that random projections can be used to effectively sample the permutation. The result is a stochastic hill-climbing algorithm that yields burn-in times significantly smaller than those of collapsed Gibbs sampling. PublicationSparsity in Dependency Grammar Induction(2010-07-01) Taskar, Ben; Pereira, Fernando CN; Graca, Joao V; Gillenwater, Jennifer; Ganchev, KuzmanA strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques. PublicationLearning Sparse Markov Network Structure via Ensemble-of-Trees Models(2009-04-01) Taskar, Ben; Lin, Yuanqing; Zhu, Shenghuo; Lee, DanielLearning the sparse structure of a general Markov network is a hard computational problem. One of the main difficulties is the computation of the generally intractable partition function. To circumvent this difficulty, we propose to learn the network structure using an ensemble-of- trees (ET) model. The ET model was first introduced by Meil˘a and Jaakkola (2006), and it represents a multivariate distribution using a mixture of all possible spanning trees. The advantage of the ET model is that, although it needs to sum over super-exponentially many trees, its partition function as well as data likelihood can be computed in a closed form. Furthermore, because the ET model tends to represent a Markov network using as small number of trees as possible, it provides a natural regularization for finding a sparse network structure. Our simulation results show that the proposed ET approach is able to accurately recover the true Markov network connectivity and outperform the state-of-art approaches for both discrete and continuous random variable network swhen a small number of data samples is available. Furthermore, we also demonstrate the usage of the ET model for discovering the network of words from blog posts. PublicationExpectation Maximization and Posterior Constraints(2007-12-01) Graca, Joao V; Ganchev, Kuzman; Taskar, BenThe expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. PublicationLearning from Partial Labels(2011-04-01) Cour, Timothee; Sapp, Benjamin; Taskar, BenWe address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partially-labeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. PublicationGenerative-Discriminitive Basis Learning for Medical Imaging(2011-01-01) Taskar, Ben; Batmanghelich, Nematollah K; Davatzikos, ChristosThis paper presents a novel dimensionality reduction method for classification in medical imaging. The goal is to transform very high-dimensional input (typically, millions of voxels) to a low-dimensional representation (small number of constructed features) that preserves discriminative signal and is clinically interpretable. We formulate the task as a constrained optimization problem that combines generative and discriminative objectives and show how to extend it to the semisupervised learning (SSL) setting. We propose a novel largescale algorithm to solve the resulting optimization problem. In the fully supervised case, we demonstrate accuracy rates that are better than or comparable to state-of-the-art algorithms on several datasets while producing a representation of the group difference that is consistent with prior clinical reports. Effectiveness of the proposed algorithm for SSL is evaluated with both benchmark and medical imaging datasets. In the benchmark datasets, the results are better than or comparable to the state-of-the-art methods for SSL. For evaluation of the SSL setting in medical datasets, we use images of subjects with Mild Cognitive Impairment (MCI), which is believed to be a precursor to Alzheimer’s disease (AD), as unlabeled data. AD subjects and Normal Control (NC) subjects are used as labeled data, and we try to predict conversion from MCI to AD on follow-up. The semi-supervised extension of this method not only improves the generalization accuracy for the labeled data (AD/NC) slightly but is also able to predict subjects which are likely to converge to AD. PublicationMixture-of-Parents Maximum Entropy Markov Models(2007-07-01) Rosenberg, David; Klein, Dan; Taskar, BenWe present the mixture-of-parents maximum entropy Markov model (MoP-MEMM), a class of directed graphical models extending MEMMs. The MoP-MEMM allows tractable incorporation of long-range dependencies be- tween nodes by restricting the conditional distribution of each node to be a mixture of distributions given the parents. We show how to efficiently compute the exact marginal posterior node distributions, regardless of the range of the dependencies. This enables us to model non-sequential correlations present within text documents, as well as between in- terconnected documents, such as hyperlinked web pages. We apply the MoP-MEMM to a named entity recognition task and a web page classification task. In each, our model shows significant improvement over the basic MEMM, and is competitive with other long- range sequence models that use approximate inference. 1 Introduction PublicationLearning Sparse Markov Network Structure via Ensemble-of-Trees Model(2009-04-01) Taskar, Ben; Zhu, Shenghuo; Lin, Yuanqing; Lee, David DLearning the sparse structure of a general Markov network is a hard computational problem. One of the main difficulties is the computation of the generally intractable partition function. To circumvent this difficulty, we propose to learn the network structure using an ensemble-of- trees (ET) model. The ET model was first introduced by Meil˘a and Jaakkola (2006), and it represents a multivariate distribution using a mixture of all possible spanning trees. The advantage of the ET model is that, although it needs to sum over super exponentially many trees, its partition function as well as data likelihood can be computed in a closed form. Furthermore, because the ET model tends to represent a Markov network using as small number of trees as possible, it provides a natural regularization for finding a sparse network structure. Our simulation results show that the proposed ET approach is able to accurately recover the true Markov network connectivity and outperform the state-of-art approaches for both discrete and continuous random variable networks when a small number of data samples is available. Furthermore, we also demonstrate the usage of the ET model for discovering the network of words from blog posts.