Using Syntactic Information in Document Filtering: A Comparative Study of Part-of-Speech Tagging and Supertagging

Chandrasekar, R.; Srinivas, B.

Using Syntactic Information in Document Filtering: A Comparative Study of Part-of-Speech Tagging and Supertagging

Files

96_29.pdf (194.88 KB)

Penn collection

IRCS Technical Reports Series

Permalink

https://repository.upenn.edu/handle/20.500.14332/37519

View all metadata

Author

Chandrasekar, R.

Srinivas, B.

Abstract

Any coherent text contains significant latent information, such as syntactic structure and patterns of language use. This information can be exploited to overcome the inadequacies of keyword-based retrieval and make information retrieval more efficient. In this paper, we demonstrate quantitatively how syntactic information is useful in filtering out irrelevant documents. We also compare two different syntactic labelings-- simple Part-of-Speech (POS) labeling and Supertag labeling-- and show how the richer (more fine-grained) representation of supertags leads to more efficient and effective document filtering. We have implemented a system which exploits syntactic information in a flexible manner to filter documents. The system has been tested on a large collection of news sentences, and achieves an F-score of 89 for filtering out irrelevant sentences. Its performance and modularity makes it a promising postprocessing addition to any Information Retrieval system.

Publication date

1996-12-01

Comments

University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-96-29.

Collection

Reports