Term selection for information retrieval applications

J. Michael Schultz, University of Pennsylvania


The selection and identification of terms is an important part of many natural language applications. In the information retrieval domain documents are often abbreviated to their most salient terms in order to reduce storage requirements and processing time and also to make algorithms more efficient. The quality of search results is a direct reflection of the quality of these representative features. In translingual applications translation dictionaries must be built in order to bridge the gap between source and target languages. With limited time and resources the most effective terms for translation must somehow be chosen. Techniques for term selection are also fundamental to a number of other tasks including automatic generation of indices, concordances and abstracts and the extraction of terminology. In this dissertation we investigate methods for selecting terms in the context of a number of specific tasks. As a practical test-case for some of the approaches developed here, we participate in the formal evaluations of Topic Detection and Tracking. In the spirit of residual-idf, a metric which measures deviation from Poisson, we develop a sum-log-ratios metric which improves upon residual-idf in two significant ways—it incorporates document length normalization and it is a function of the entire within-document term count distribution. Also developed here is the idea of a “universal dictionary” as a basis for translingual information retrieval tasks. In the methods section, we describe a suffix array based indexing scheme ideally suited to efficiently calculate within-document term counts for ngrams in very large corpora. We test our methods in a number of real-world applications. In the formal evaluations of TDT2 we show that the simple vector space model performs as well as much more complicated models. In the context of building a “universal dictionary”, we use our method of term selection to choose a vocabulary of less than 10,000 terms which is essentially as effective for topic tracking as an unlimited vocabulary of over 300,000 terms. We demonstrate that this same method extends well to other applications, employing it as a novel approach to multi-word terminology and collocation extraction.

Subject Area

Linguistics|Computer science

Recommended Citation

Schultz, J. Michael, "Term selection for information retrieval applications" (2003). Dissertations available from ProQuest. AAI3109218.