IRCS Technical Reports Series
Date of this Version
In this paper, we describe a system called Glean, which is predicated on the idea that any coherent text contains significant latent information, such as syntactic structure and patterns of language use, which can be used to enhance the performance of Information Retrieval systems. We propose an approach to information retrieval that makes use of syntactic information obtained using a tool called a supertagger. A supertagger is used on a corpus of training material to semi-automatically induce patterns that we call augmented-patterns. We show how these augmented patterns may be used along with a standard Web search engine or an IR system to retrieve information, and to identify relevant information and filter out irrelevant items. We describe an experiment in the domain of official appointments, where such patterns are shown to reduce the number of potentially irrelevant documents by upwards of 80%.
Date Posted: 13 September 2006
University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-96-31.