Pattern discovery in biological data sets

Stanislav Plamenov Angelov, University of Pennsylvania


In recent years, we have seen a rapid increase in the available DNA and protein data coming from various genome sequencing projects. Such data is carefully studied for features reused by nature in order to understand the mechanisms of life. Many of these features are expressed as sequence patterns. Therefore, efficient computational methods to discover biologically significant motifs are highly desirable as they provide researchers with new insights into biological processes, causes of diseases, and evolution of life. There are two main approaches for extracting knowledge from sequence data. One approach compares newly acquired data with possibly, already annotated data under the assumption that data similarity implies functional similarity. The second approach mines the data for frequently occurring or surprising patterns. Such patterns are unlikely to occur at random and pinpoint candidates for further laboratory investigations. In this thesis, we follow the above approaches to extract useful information from biological data sets such as DNA and protein sequences, as well as microarray-based gene expression profiles. Our contributions include linear time and near-linear time algorithms to enumerate short DNA substrings that contain evolutionary history, efficient algorithms for design of composite patterns with application to PCR, and new techniques for automated protein domain discovery using correlation clustering. We also give fast exact and approximation methods for nonparametric analysis of gene expression data using isotonic regression. In addition to these theoretical results, we implement our methods and analyze the findings on real, biological data.

Subject Area


Recommended Citation

Angelov, Stanislav Plamenov, "Pattern discovery in biological data sets" (2007). Dissertations available from ProQuest. AAI3260873.