Big Data for Microorganisms: Computational Approaches Leveraging Large-Scale Microbial Transcriptomic Compendia

Alexandra Jin-Bao Lee, University of Pennsylvania


Genome-wide transcriptomics data captures the molecular state of microorganisms – the expression patterns of genes in response to some condition or stimuli. With advancements in high-throughput sequencing technologies, there are thousands of microbial transcription profiles publicly available. Consequently, this data has been collected and integrated to form transcriptomic compendia, which are collections of diverse gene expression experiments. These compendia were found to be a valuable resource for studying systems level biology and hypothesis generation. We describe the construction, benefits and challenges in creating microbial transcriptomic compendia in Chapter 1. One challenge for compendia, which integrates across many different experiments, is batch effects, which are technical sources of variability that can disrupt the detection of underlying biological signals of interest. In Chapter 2, we use a generative neural network to simulate gene expression compendia with varying amounts of technical variability and assess the ability to detect the underlying biological structure in the data after noise was added and then after batch correction was applied. We define a set of principles for how batch correction should be used in the context of these large-scale compendia. In Chapter 3 and 4 we introduce computational approaches to use compendia to improve the analysis of individual experiments and analysis of genomic patterns respectively. In Chapter 3, we develop a portable framework to distinguish between common and context specific transcriptional signals using a compendium to autogenerate a null set of expression changes. This approach allows researchers to put gene expression changes from their individual experiment of interest into the context of existing compendia of experiments. In Chapter 4 we develop an approach to examine the effect of different Pseudomonas aeruginosa genomes, using two dominant strain types, on transcriptional profiles in order to understand how traits manifest. This genome-wide approach reveals a more complete picture of how different genomes affect expression, which mediates different traits present. Overall, these compendia provide a valuable resource that computational tools can leverage to extract patterns and inform research directions.

Subject Area

Genetics|Computer science|Microbiology|Artificial intelligence|Bioinformatics

Recommended Citation

Lee, Alexandra Jin-Bao, "Big Data for Microorganisms: Computational Approaches Leveraging Large-Scale Microbial Transcriptomic Compendia" (2022). Dissertations available from ProQuest. AAI29162329.