Controlling Complexity in Part-of-Speech Induction

Loading...
Thumbnail Image
Penn collection
Departmental Papers (CIS)
Degree type
Discipline
Subject
Computer Sciences
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Graça, João V
Ganchev, Kuzman
Coheur, Luísa
Pereira, Fernando
Contributor
Abstract

We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via parametric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training. We also provide an open-source implementation of the algorithm. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2011-07-01
Journal title
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Graca, J., Ganchev, K., Pereira, F., Coheur, L., & Taskar, B. (2011). Controlling Complexity in Part-of-Speech Induction. Journal of Artificial Intelligence Research, 41, (527-551). doi: http://dx.doi.org/10.1613/jair.3348 © 2011 Advancement of Artificial Intelligence
Recommended citation
Collection