Combining labeled and unlabeled data in statistical natural language parsing

Anoop Sarkar, University of Pennsylvania


Ambiguity resolution in the parsing of natural language requires a vast repository of knowledge to guide disambiguation. An effective approach to this problem is to use machine learning algorithms to acquire the needed knowledge and to extract generalizations about disambiguation decisions. Such parsing methods require a corpus-based approach with a collection of correct parses compiled by human experts. Current statistical parsing models suffer from sparse data problems, and experiments have indicated that more labeled data will improve performance. In this dissertation, we explore methods that attempt to combine human supervision with machine learning algorithms to try and extend accuracy beyond what is possible with the use of limited amounts of labeled data. In each case we do this by exposing a machine learning algorithm to unlabeled data in addition to the existing labeled data. Most recent research in parsing has shown the advantage of having a lexicalized model, where the word relationships mediate knowledge about disambiguation decisions. We use Lexicalized Tree Adjoining Grammars (TAGs) as the basis of our machine learning algorithm since they arise naturally from the lexicalization of Context Free Grammars (CFGs). We show in this dissertation that probability measures applied to TAGs retain the simplicity of probabilistic CFGs along with its elegant formal properties and that while PCFGs need additional independence assumptions to be useful in statistical parsing, no such changes need to be made to probabilistic TAGs. The main results presented in this dissertation are: (1) We extend the Co-Training algorithm (Yarowsky 1995; Blum and Mitchell 1998), a machine learning technique for combining labeled and unlabeled data previously used with classifiers with 2/3 labels to the more complex problem of statistical parsing. Using empirical results based on parsing the Wall Street Journal corpus we show that training a statistical parser on the combined labeled and unlabeled data strongly outperforms training only on the labeled data. (2) We present a machine learning algorithm that can be used to discover previously unknown subcategorization frames. The algorithm can then be used to label dependents of a verb in a treebank as either arguments or adjuncts. We use this algorithm to augment the Czech Dependency Treebank with argument/adjunct information. (3) We extend a supervised classifier for automatically identifying verb alternation classes for a set of verbs so that it can be used on minimally annotated data. Previous work (Merlo and Stevenson 2001) provided a classifier for this task that used automatically parsed text. With the use of learning of subcategorization frames we construct the same type of classifier which now requires text annotated with part-of-speech tags and phrasal chunks. In each of these results we use some existing linguistic resource that has been annotated by humans and add some further significant linguistic annotation by applying statistical machine learning algorithms.

Subject Area

Computer science|Linguistics|Artificial intelligence

Recommended Citation

Sarkar, Anoop, "Combining labeled and unlabeled data in statistical natural language parsing" (2002). Dissertations available from ProQuest. AAI3043949.