Date of Award
Doctor of Philosophy (PhD)
Computer and Information Science
Mitchell P. Marcus
Automatic summarization has advanced greatly in the past few decades. However, there remains a huge gap between the content quality of human and machine summaries. There is also a large disparity between the performance of current systems and that of the best possible automatic systems. In this thesis, we explore how the content quality of machine summaries can be improved.
First, we introduce a supervised model to predict the importance of words in the input sets, based on a rich set of features. Our model is superior to prior methods in identifying words used in human summaries (i.e., summary keywords). We show that a modular extractive summarizer using the estimates of word importance can generate summaries comparable to the state-of-the-art systems. Among the features we propose, we highlight global knowledge, which estimate word importance based on information independent of the input. In particular, we explore two kinds of global knowledge: (1) important categories mined from dictionaries, and (2) intrinsic importance of words. We show that global knowledge is very useful in identifying summary keywords that have low frequency in the input.
Second, we present a new framework of system combination for multi-document summarization. This is motivated by our observation that different systems generate very different summaries. For each input set, we generate candidate summaries by combining whole sentences produced by different systems. We show that the oracle summary among these candidates is much better than the output from the systems that we have combined. We then introduce a support vector regression model to select among these candidates. The features we employ in this model capture the informativeness of a summary based on the input documents, the outputs of different systems, and global knowledge. Our model achieves considerable improvement over the systems that we have combined while generating summaries up to a certain length. Furthermore, we study what factors could affect the success of system combination. Experiments show that it is important for the systems combined to have a similar performance.
Hong, Kai, "Content Selection in Multi-Document Summarization" (2015). Publicly Accessible Penn Dissertations. 1765.