IRCS Technical Reports Series
Thesis or dissertation
Date of this Version
Most documents are aboutmore than one subject, but the majority of natural language processing algorithms and information retrieval techniques implicitly assume that every document has just one topic. The work described herein is about clues which mark shifts to new topics, algorithms for identifying topic boundaries and the uses of such boundaries once identified.
A number of topic shift indicators have been proposed in the literature. We review these features, suggest several new ones and test most of them in implemented topic segmentation algorithms. Hints about topic boundaries include repetitions of character sequences, patterns of word and word n-gram repetition, word frequency, the presence of cue words and phrases and the use of synonyms.
The algorithms we present use cues singly or in combination to identify topic shifts in several kinds of documents. One algorithm tracks compression performance, which is an indicator of topic shift because self-similarity within topic segments should be greater than between-segment similarity. Another technique relies on word repetition and places boundaries by minimizing word repetitions across segment boundaries. A third method compares the performance of a language model with and without knowledge of the contents of preceding sentences to determine whether a topic shift has occurred. We use the output of this algorithm in a statistical model which incorporates synonymy, bigram repetition and other features for topic segmentation.
We benchmark our algorithms and compare them to algorithms from the literature using concatenations of documents, and then perform further evaluation of our techniques using a collection of news broadcasts transcribed both by annotators and using a speech recognition system. We also test the effectiveness of our algorithms for identifying both chapter boundaries in works of literature and story boundaries in Spanish news broadcasts.
We suggest ways to improve information retrieval, language modeling and various natural language processing algorithms by exploiting the topic segmentation.
Date Posted: 20 August 2006
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-21.