Importance Sampling of Word Patterns in DNA and Protein Sequences

Loading...
Thumbnail Image
Penn collection
Statistics Papers
Degree type
Discipline
Subject
importance sampling
biological sequence analysis
motif analysis
Biostatistics
Computational Biology
Genetics and Genomics
Statistics and Probability
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Chan, Hock Peng
Zhang, Nancy R
Chen, Louis H. Y
Contributor
Abstract

The use of Monte Carlo evaluation to compute p-values of pattern counting test statistics is especially attractive when an asymptotic theory is absent or when the search sequence or the word pattern is too short for an asymptotic formula to be accurate. The drawback of applying Monte Carlo simulations directly is its inefficiency when p-values are small, which precisely is the situation of importance. In this paper, we provide a general importance sampling algorithm for efficient Monte Carlo evaluation of small p-values of pattern counting test statistics and apply it on word patterns of biological interest, in particular palindromes and inverted repeats, patterns arising from position specific weight matrices, as well as co-occurrences of pairs of motifs. We also show that our importance sampling technique satisfies a log efficient criterion.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2010-01-01
Journal title
Journal of Computational Biology
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
At the time of publication, author Nancy R. Zhang was affiliated with Stanford University. Currently, she is a faculty member at the Statistics Department at the University of Pennsylvania.
Recommended citation
Collection