Exploring Linguistic Constraints in Nlp Applications

Zhao, Qiuye

Exploring Linguistic Constraints in Nlp Applications

Files

Zhao_upenngdas_0175C_11324.pdf (1.68 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Subject

Chinese word segmentation
closed-class words
long-tail distribution
Morphology learning
natural language processing
Unsupervised POS tagging
Computer Sciences

Copyright date

2015-11-16T00:00:00-08:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/28343

View all metadata

Author

Zhao, Qiuye

Abstract

The key argument of this dissertation is that the success of an Natural Language Processing (NLP) application depends on a proper representation of the corresponding linguistic problem. This theme is raised in the context that the recent progress made in our field is widely credited to the effective use of strong engineering techniques. However, the intriguing power of highly lexicalized models shown in many NLP applications is not only an achievement by the development in machine learning, but also impossible without the extensive hand-annotated data resources made available, which are originally built with very deep linguistic considerations. More specifically, we explore three linguistic aspects in this dissertation: the distinction between closed-class vs. open-class words, long-tail distributions in vocabulary study and determinism in language models. The first two aspects are studied in unsupervised tasks, unsupervised part-of-speech (POS) tagging and morphology learning, and the last one is studied in supervised tasks, English POS tagging and Chinese word segmentation. Each linguistic aspect under study manifests itself in a (different) way to help improve performance or efficiency in some NLP application.

Advisor

Mitchell P. Marcus

Date of degree

2014-01-01

Collection

Dissertations and Theses