Evaluating grammar formalisms for applications to natural language processing and biological sequence analysis

David Chiang, University of Pennsylvania

Abstract

Grammars are gaining importance in statistical natural language processing and computational biology as a means of encoding theories and structuring algorithms. But one serious obstacle to applications of grammars is that formal language theory traditionally classifies grammars according to their weak generative capacity (WGC)—what sets of strings they generate—and tends to ignore strong generative capacity (SGC)—what sets of structural descriptions they generate—even though the latter is more relevant to applications. This dissertation develops and demonstrates, for the first time, a framework for carrying out rigorous comparisons of grammar formalisms in terms of their usefulness for applications. We do so by adopting Miller's view of SGC as pertaining not directly to structural descriptions but their interpretations in particular domains; and, following Joshi et al., by appropriately constraining the grammars and interpretations we consider. We then consider three areas of application. The first area is statistical parsing. We find that, in this domain, attempts to increase the SGC of a formalism can often be compiled back into the simpler formalism, gaining nothing. But this suggests a new view of current parsing models as compiled versions of grammars from richer formalisms. We discuss the implications of this view and its implementation in a probabilistic tree-adjoining grammar model, with experimental results on English and Chinese. For our other two applications, by contrast, we can readily increase the SGC of a formalism without increasing its computational complexity. For natural language translation, we discuss the formal, linguistic, and computational properties of a formalism that is more powerful than those currently proposed for statistical machine translation systems. Finally, we explore the application of formal grammars to modeling secondary/tertiary structures of biological sequences. We show how additional SGC can be used to extend models to take more complex structures into account, paying special attention to the technique of intersection, which has drawn comparatively little attention in computational linguistics. These results should pave the way for theoretical research to pursue results that are more directed towards applications, and for practical research to explore the use of advanced grammar formalisms more easily.

Subject Area

Computer science

Recommended Citation

Chiang, David, "Evaluating grammar formalisms for applications to natural language processing and biological sequence analysis" (2004). Dissertations available from ProQuest. AAI3137994.
https://repository.upenn.edu/dissertations/AAI3137994

Share

COinS