Structures and distributions in morphology learning

Erwin Chan, University of Pennsylvania


One of the great challenges in linguistics and cognitive science is to understand the nature of the mental representation of language. The precise mechanisms of the mind are unknown, but can be modeled through observation and experimentation. By viewing the mind as a computational device that receives input (primary linguistic data) and produces output (the development of grammatical speech) during language acquisition, one can reason about what representations and algorithms must be internal to the learner. In this thesis, I investigate the acquisition of morphology. The principal challenges are how to learn a theory in the presence of sparse data, and in a manner that can provide explanations for the developmental processes in child language acquisition. The main idea underlying this work is that a consideration of the different aspects of language acquisition places strong constraints on cognitively plausible representations and algorithms that are internal to the learner. To develop a model of morphology acquisition, I pursue three lines of work: First, I formulate a cognitively-oriented computational framework for studying language acquisition that consists of four components: the linguistic representation, the statistical distribution of the input data, the observed behavior of the human learner, and the performance of the learning algorithm. All four components and their interactions are important for understanding language acquisition. Second, I examine the statistical distributions of morphology in naturally occurring corpora and discuss their implications for acquisition and theories of morphology. The Zipfian distribution of morphological inflections favors a rule-based model of morphology, where rules are learned one at a time by relating them to the morphological base. This provides an explanation of children's incremental acquisition of morphology. Third, to provide empirical support for this theory of acquisition, I implement unsupervised algorithms for the induction of morphology and lexical categories from text corpora. A rule-based model of morphology called the base-and-transforms model is learned, which consists of lexical categories, morphological base forms, and rules that convert base forms to other inflections. Morphological base forms play an important role in bootstrapping the acquisition of morphological relations, and simplify lexical category induction through distributional analysis.

Subject Area

Artificial intelligence|Computer science

Recommended Citation

Chan, Erwin, "Structures and distributions in morphology learning" (2008). Dissertations available from ProQuest. AAI3328537.