Topic Segmentation: Algorithms And Applications

Loading...
Thumbnail Image
Degree type
Graduate group
Discipline
Subject
Databases and Information Systems
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Reynar, Jeffrey C.
Contributor
Abstract

Most documents are aboutmore than one subject, but the majority of natural language processing algorithms and information retrieval techniques implicitly assume that every document has just one topic. The work described herein is about clues which mark shifts to new topics, algorithms for identifying topic boundaries and the uses of such boundaries once identified. A number of topic shift indicators have been proposed in the literature. We review these features, suggest several new ones and test most of them in implemented topic segmentation algorithms. Hints about topic boundaries include repetitions of character sequences, patterns of word and word n-gram repetition, word frequency, the presence of cue words and phrases and the use of synonyms. The algorithms we present use cues singly or in combination to identify topic shifts in several kinds of documents. One algorithm tracks compression performance, which is an indicator of topic shift because self-similarity within topic segments should be greater than between-segment similarity. Another technique relies on word repetition and places boundaries by minimizing word repetitions across segment boundaries. A third method compares the performance of a language model with and without knowledge of the contents of preceding sentences to determine whether a topic shift has occurred. We use the output of this algorithm in a statistical model which incorporates synonymy, bigram repetition and other features for topic segmentation. We benchmark our algorithms and compare them to algorithms from the literature using concatenations of documents, and then perform further evaluation of our techniques using a collection of news broadcasts transcribed both by annotators and using a speech recognition system. We also test the effectiveness of our algorithms for identifying both chapter boundaries in works of literature and story boundaries in Spanish news broadcasts. We suggest ways to improve information retrieval, language modeling and various natural language processing algorithms by exploiting the topic segmentation.

Advisor
Date of degree
1998-08-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-98-21.
Recommended citation