Automatically annotating documents with normalized gene lists

Loading...
Thumbnail Image
Penn collection
Departmental Papers (CIS)
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Crim, Jeremiah
McDonald, Ryan
Pereira, Fernando
Contributor
Abstract

Background: Document gene normalization is the problem of creating a list of unique identifiers for genes that are mentioned within a document. Automating this process has many potential applications in both information extraction and database curation systems. Here we present two separate solutions to this problem. The first is primarily based on standard pattern matching and information extraction techniques. The second and more novel solution uses a statistical classifier to recognize valid gene matches from a list of known gene synonyms. Results: We compare the results of the two systems, analyze their merits and argue that the classification based system is preferable for many reasons including performance, simplicity and robustness. Our best systems attain a balanced precision and recall in the range of 74%–92%, depending on the organism.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2005-05-24
Journal title
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Reprinted from BMC Bioinformatics, Volume 6 (Suppl 1), Article Number S13, May 24, 2004, 7 pages.
Recommended citation
Collection