Controlling Complexity in Part-of-Speech Induction

Graça, João V; Ganchev, Kuzman; Coheur, Luísa; Pereira, Fernando; Taskar, Ben

Controlling Complexity in Part-of-Speech Induction

Files

fulltext.pdf (498.5 KB)

Penn collection

Departmental Papers (CIS)

Subject

Computer Sciences

Permalink

https://repository.upenn.edu/handle/20.500.14332/6550

View all metadata

Author

Graça, João V

Ganchev, Kuzman

Coheur, Luísa

Pereira, Fernando

Taskar, Ben

Abstract

We consider the problem of fully unsupervised learning of grammatical (part-of-speech) categories from unlabeled text. The standard maximum-likelihood hidden Markov model for this task performs poorly, because of its weak inductive bias and large model capacity. We address this problem by refining the model and modifying the learning objective to control its capacity via parametric and non-parametric constraints. Our approach enforces word-category association sparsity, adds morphological and orthographic features, and eliminates hard-to-estimate parameters for rare words. We develop an efficient learning algorithm that is not much more computationally intensive than standard training. We also provide an open-source implementation of the algorithm. Our experiments on five diverse languages (Bulgarian, Danish, English, Portuguese, Spanish) achieve significant improvements compared with previous methods for the same task.

Publication date

2011-07-01

Comments

Graca, J., Ganchev, K., Pereira, F., Coheur, L., & Taskar, B. (2011). Controlling Complexity in Part-of-Speech Induction. Journal of Artificial Intelligence Research, 41, (527-551). doi: http://dx.doi.org/10.1613/jair.3348 © 2011 Advancement of Artificial Intelligence

Collection

Articles