IRCS Technical Reports Series

Document Type

Thesis or dissertation

Date of this Version

March 1998


University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-15.


This thesis demonstrates that several important kinds of natural language ambiguities can be resolved to state-of-the-art accuracies using a single statistical modeling technique based on the principle of maximum entropy.

We discuss the problems of sentence boundary detection, part-of-speech tagging, prepositional phrase attachment, natural language parsing, and text categorization under the maximum entropy framework. In practice, we have found that maximum entropy models offer the following advantages:

State-of-the-art Accuracy: The probability models for all of the tasks discussed perform at or near state-of-the-art accuracies, or outperform competing learning algorithms when trained and tested under similar conditions. Methods which outperform those presented here require much more supervision in the form of additional human involvement or additional supporting resources.

Knowledge-Poor Features: The facts used to model the data, or features, are linguistically very simple, or "knowledge-poor" but yet succeed in approximating complex linguistic relationships.

Reusable Software Technology: The mathematics of the maximum entropy framework are essentially independent of any particular task, and a single software implementation can be used for all of the probability models in this thesis.

The experiments in this thesis suggest that experimenters can obtain state-of-the-art accuracies on a wide range of natural language tasks, with little task-specific effort, by using maximum entropy probability models.



Date Posted: 20 August 2006