Linear structure models for eukaryotic gene prediction

Axel E Bernal, University of Pennsylvania

Abstract

Computational gene prediction plays an important role in finding genes in genomic DNA. Although improvement has been steady, the state of the art in automated prediction is not accurate enough to correctly predict all, or even most genes, specially in complex genomes. However, with genomic information being created at an ever increasing rate, it is worth investigating new approaches that can efficiently integrate different types of genomic evidence with complex statistical dependencies by discriminative learning to maximize directly annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines in many classification tasks. ^ Here we present a family of gene predictors that use a linear structure model based on a conditional random field that is trained discriminatively with an online large-margin algorithm. As proof of concept, we first developed CRAIG, an ab initio gene predictor for eukaryotic organisms. With CRAIG, we pioneered the idea of using global discriminative learning to trade off both content and signal feature weights simultaneously to directly maximize annotation accuracy. CRAIG's results improved performance significantly over the best ab initio gene predictors available for H. sapiens in relevant benchmark data sets, with a relative mean improvement of 10.9% over Augustus, the previously best ab initio genefinder for human. ^ These positive initial results motivated us to extend the CRAIG base model to include comparative features. The resulting gene predictor, named nCRAIG, is a de novo genefinder that can globally combine information from multiple alignments to informant genomes. nCRAIG prediction results in the ENCODE test regions are comparable to the best available de novo gene predictors for H. sapiens. ^ We also targeted the problem of automated gene-model curation. The gene predictor eCRAIG is an ensemble-type genefinder for automated curation that constructs gene models by integrating multiple sources of transcriptional and translational evidence, including annotations made by other genefinders. Here, we achieved significant improvements over the best ensemble predictors available for H. sapiens, C. elegans and A. thaliana by using a novel set of non-local features. In particular, eCRAIG achieved a relative mean improvement of 5.1% over Jigsaw, the best published combiner-type predictor in all our experiments. We also defined a set of protein-level extrinsic features which when added to the eCRAIG model, resulted in significant improvements in first coding exon prediction of secretory proteins in T. gondii and P. falciparum. The protein-level features used in this new predictor, named eCRAIG+SP, are based on signal peptide scores computed by SignalP v.3. ^ Recent developments in NGS techniques such as RNA-Seq and the subsequent availability of massive amounts of transcriptome sequencing data have opened new research directions regarding the important problem of alternative splicing prediction. In this respect, we explored uses of the CRAIG suite together with RNA-Seq read data for isoform discovery. We present preliminary results for RNA-Seq based UTR and isoform predictions in selected gene loci with experimentally confirmed UTR spliced variants. We also provided specific recommendations for future research directions use RNA-Seq.^

Subject Area

Computer Science

Recommended Citation

Axel E Bernal, "Linear structure models for eukaryotic gene prediction" (January 1, 2012). Dissertations available from ProQuest. Paper AAI3508970.
http://repository.upenn.edu/dissertations/AAI3508970

Share

COinS