Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

Loading...
Thumbnail Image
Penn collection
Departmental Papers (CIS)
Degree type
Discipline
Subject
craig
crf-based ab initio genefinder
crf
conditional random fields
hmm
hidden markov model
mira
margin infused relaxed algorithm
pwm
position weight matrices
svm
support vector machines
tis
translation initiation site
wam
weight array model
wwam
windowed weight array model
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Bernal, Axel
Crammer, Koby
Hatzigeorgiou, Artemis
Contributor
Abstract

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2007-03-16
Journal title
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
© 2007 Bernal et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Reprinted from PLOS Computational Biology, Volume 3 Issue 3, e54. Publisher URL: http://dx.doi.org/10.1371/journal.pcbi.0030054
Recommended citation
Collection