Integrating gene expression signals with bounded collection grammars

Jonathan Schug, University of Pennsylvania


Tissue-specific expression is one of the most obvious and important patterns of gene expression in complex eukaryotes. Every cell in an organism has the same set of genes, yet only a subset of the genes are expressed in a given cell type. This regulation is accomplished in large part by transcription factors (TF's) that bind to short degenerate genomic sequences called binding sites near the genes they regulate. TF's work in combination to provide precise regulation of gene expression. Understanding the combinatorics of TF regulation is still an open problem in post-genomic biology. In this dissertation we develop and apply a bounded collection grammar (BCG) formalism, similar to permutation grammars, and a machine-learning algorithm to model, search for, and learn the combinations and arrangements of TF's that regulate tissue-specific expression. Our machine-learning algorithm allows for the optimization of free parameters in a grammar such as spacing and scores to identify the best possible performance of a rule. This system provides a unique combination of modeling power and learning ability. To identify tissue-specific genes from tissue surveys of gene expression, we apply Shannon entropy Hg to quantify overall specificity, then develop and apply a new metric entropy-based Qq: t to quantify specificity to a particular tissue, t. We take a stepwise approach to promoter analysis by first studying specific and ubiquitous promoters in general to determine global characteristics. We then study the genes specific to a particular tissue in this global context. Our analysis of mouse and human promoters ranked by Hg identifies the TATA box and CpG island as the major determinants of tissue-specificity. We find there are functional correlates of the TATA/CpG class of a gene's promoter. We identified TF's enriched in liver promoters and studied their arrangements to refine and extend earlier results by identifying one known rule and many new rules. Finally, we performed sequence analysis of ChIP-chip experiments to identify the companion factors of the ChIP-chip target factor that help define the active sites in the direct target genes demonstrating that our machine learning system can also contribute to the understanding of other regulatory events.

Subject Area

Computer science

Recommended Citation

Schug, Jonathan, "Integrating gene expression signals with bounded collection grammars" (2005). Dissertations available from ProQuest. AAI3179807.