A Corpus-Based Approach to Language Learning
One goal of computational linguistics is to discover a method for assigning a rich structural annotation to sentences that are presented as simple linear strings of words; meaning can be much more readily extracted from a structurally annotated sentence than from a sentence with no structural information. Also, structure allows for a more in-depth check of the well-formedness of a sentence. There are two phases to assigning these structural annotations: first, a knowledge base is created and second, an algorithm is used to generate a structural annotation for a sentence based upon the facts provided in the knowledge base. Until recently, most knowledge bases were created manually by language experts. These knowledge bases are expensive to create and have not been used effectively in structurally parsing sentences from other than highly restricted domains. The goal of this dissertation is to make significant progress toward designing automata that are able to learn some structural aspects of human language with little human guidance. In particular, we describe a learning algorithm that takes a small structurally annotated corpus of text and a larger unannotated corpus as input, and automatically learns how to assign accurate structural descriptions to sentences not in the training corpus. The main tool we use to automatically discover structural information about language from corpora is transformation-based error-driven learning. The distribution of errors produced by an imperfect annotator is examined to learn an ordered list of transformations that can be applied to provide an accurate structural annotation. We demonstrate the application of this learning algorithm to part of speech tagging and parsing. Successfully applying this technique to create systems that learn could lead to robust, trainable and accurate natural language processing systems.