Identifying transcriptional regulatory modules and networks by integrative approaches

Guang Chen, University of Pennsylvania


There is great interest in understanding the genetic program of cellular response and differentiation. However, the transcriptional regulatory networks that specify and maintain cellular function are still largely uncharted. The recent advent of high-throughput technologies provides genome-wide (omit) measurements of molecular network components at multiple levels such as genomic sequences, mRNA expression and protein-DNA interactions. Not only does the growing availability of these omit data provide researchers with unprecedented and global views of these transcriptional regulatory systems, but it also raises the challenge of identifying and extracting biological insights from them. In this dissertation I develop and apply computational approaches to integrate these omit data for identifying transcriptional regulatory modules and networks. Comparative sequence analysis has been widely used to identify conserved transcription factor binding sites (TFBS) and proven useful in certain cases. I design and implement a pipeline to identify conserved TFBS. In comparison to previous methods, the developed pipeline has the flexibility to (1) refine orthologous sequence alignments and (2) adjust the sequence conservation ratio based on the statistical properties of a particular TFBS. Using this pipeline, we have further expanded the mammalian CArGome with the discovery of 60 novel SRF targets which are experimentally validated. This study illustrates the power of our comparative genomic analysis pipeline for identifying conserved TFBS. More recently, it has been shown that the approaches based on single data type are more likely to be biased due to the fact that each data source provides only partial information for unveiling transcriptional regulatory mechanisms. To take advantage of the complementary information provided by different types of omic data, I present a Bayesian hierarchical model and Markov Chain Monte Carlo implementation that integrates gene expression data, ChIP binding data and TFBS data in a principled and robust fashion. The applications represent both unicellular and mammalian organisms under several scenarios of available data. In these applications, the predicted gene-TF interactions are shown very likely to be biologically relevant. I also demonstrate the ability to predict gene-TF interactions with reduced levels of false positives. Our full probabilistic modeling approach for discovering regulatory networks provides a flexible framework for utilizing all available biological data, while overcoming the intrinsic limitations of other available methods such as the need for prior clustering of expression data and arbitrary parameter thresholds.

Subject Area

Biomedical research

Recommended Citation

Chen, Guang, "Identifying transcriptional regulatory modules and networks by integrative approaches" (2007). Dissertations available from ProQuest. AAI3271731.