Predicting Text Quality: Metrics for Content, Organization and Reader Interest

Louis, Annie

Predicting Text Quality: Metrics for Content, Organization and Reader Interest

Files

Louis_upenngdas_0175C_10515.pdf (1.39 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Subject

readability
text quality
writing quality
Computer Sciences

Copyright date

2014-08-20T20:13:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/32430

View all metadata

Author

Louis, Annie

Abstract

When people read articles---news, fiction or technical---most of the time if not always, they form perceptions about its quality. Some articles are well-written and others are poorly written. This thesis explores if such judgements can be automated so that they can be incorporated into applications such as information retrieval and automatic summarization. Text quality does not involve a single aspect but is a combination of numerous and diverse criteria including spelling, grammar, organization, informative nature, creative and beautiful language use, and page layout. In the education domain, comprehensive lists of such properties are outlined in the rubrics used for assessing writing. But computational methods for text quality have addressed only a handful of these aspects, mainly related to spelling, grammar and organization. In addition, some text quality aspects could be more relevant for one genre versus another. But previous work have placed little focus on specialized metrics based on the genre of texts. This thesis proposes new insights and techniques to address the above issues. We introduce metrics that score varied dimensions of quality such as content, organization and reader interest. For content, we present two measures: specificity and verbosity level. Specificity measures the amount of detail present in a text while verbosity captures which details are essential to include. We measure organization quality by quantifying the regularity of the intentional structure in the article and also using the specificity levels of adjacent sentences in the text. Our reader interest metrics aim to identify engaging and interesting articles. The development of these measures is backed by the use of articles from three different genres: academic writing, science journalism and automatically generated summaries. Proper presentation of content is critical during summarization because summaries have a word limit. Our specificity and verbosity metrics are developed with this genre as the focus. The argumentation structure of academic writing lends support to the idea of using intentional structure to model organization quality. Science journalism articles convey research findings in an engaging manner and are ideally suited for the development and evaluation of measures related to reader interest.

Advisor

Ani Nenkova

Date of degree

2013-01-01

Collection

Dissertations and Theses