Date of Award
Doctor of Philosophy (PhD)
Computer and Information Science
When people read articles---news, fiction or technical---most of the time if not always, they form perceptions about its quality. Some articles are well-written and others are poorly written. This thesis explores if such judgements can be automated so that they can be incorporated into applications such as information retrieval and automatic summarization.
Text quality does not involve a single aspect but is a combination of numerous and diverse criteria including spelling, grammar, organization, informative nature, creative and beautiful language use, and page layout. In the education domain, comprehensive lists of such properties are outlined in the rubrics used for assessing writing. But computational methods for text quality have addressed only a handful of these aspects, mainly related to spelling, grammar and organization. In addition, some text quality aspects could be more relevant for one genre versus another. But previous work have placed little focus on specialized metrics based on the genre of texts.
This thesis proposes new insights and techniques to address the above issues. We introduce metrics that score varied dimensions of quality such as content, organization and reader interest. For content, we present two measures: specificity and verbosity level. Specificity measures the amount of detail present in a text while verbosity captures which details are essential to include. We measure organization quality by quantifying the regularity of the intentional structure in the article and also using the specificity levels of adjacent sentences in the text. Our reader interest metrics aim to identify engaging and interesting articles. The development of these measures is backed by the use of articles from three different genres: academic writing, science journalism and automatically generated summaries. Proper presentation of content is critical during summarization because summaries have a word limit. Our specificity and verbosity metrics are developed with this genre as the focus. The argumentation structure of academic writing lends support to the idea of using intentional structure to model organization quality. Science journalism articles convey research findings in an engaging manner and are ideally suited for the development and evaluation of measures related to reader interest.
Louis, Annie, "Predicting Text Quality: Metrics for Content, Organization and Reader Interest" (2013). Publicly Accessible Penn Dissertations. 665.