Syntactic form and discourse function in natural language generation
Previous research has shown that certain discourse conditions are necessary for the felicitous use of four non-canonical syntactic constructions in English, topicalizations, left-dislocations, wh-clefts, and it-clefts. However, the distribution of these forms does not correlate one-to-one with the presence of these necessary conditions. Speakers must choose to use these constructions for other reasons. Additionally, a natural language generation algorithm that selects these statistically-rare forms based only on these conditions will overgenerate. If it selects clausal word order based only on frequency, however, these forms will never be selected or will be used in meaningless ways. The purpose of this dissertation is to devise a more complete model of when human speakers generate these constructions in order to further understanding of syntactic form selection and to better characterize these forms' conditions of use for purposes of NLG. The model of syntactic choice presented explicitly ties the goals of the communicative agent to the linguistic forms selected to achieve those goals. Three types of communicative goals that speakers achieve through the use of non-canonical syntax are argued for (1) attention marking, (2) discourse relation, and (3) information-structure focus disambiguation. The evidence supporting the model is based on naturally-occurring tokens from a corpus of spontaneous oral discourse. This same corpus, annotated with low-level properties of the discourse context surrounding utterances with non-canonical word order, is then used to train a statistical model that can approximate some aspects of the theoretical model. The statistical model supports the claim that communicative goals of signaling discourse relations do correlate significantly with the use of particular non-canonical forms. The statistical model is also used as a probabilistic classifier, which could be utilized as a stochastic method for selecting syntactic form based on discourse context as part of a natural language generation system. The probabilistic classifier shows improvement over a naive classifier when applied to training data. The probabilistic classifier is a first attempt to utilize more than just frequency counts as a basis for syntactic form selection and instead incorporate aspects of the semantic content of surrounding discourse context as a basis for using a particular form.
Creswell, Cassandre Yvonne, "Syntactic form and discourse function in natural language generation" (2003). Dissertations available from ProQuest. AAI3087389.