Statistical Methods for Multi-Omics Inference from Single Cell Transcriptome

Zilu Zhou, University of Pennsylvania


This thesis comprises three sections of research in statistical genomics and computational biology. Chapter 1 and Chapter 2 describe two statistical methods for multi-omics inference from single cell transcriptome, representing the theme of this thesis. Chapter 3 describes a side-project on copy number variation detection in large biobank data base. Part 1: Although scRNA-seq is now ubiquitously adopted in studies of intratumor heterogeneity, detection of somatic mutations and inference of clonal membership from scRNA-seq is currently unreliable. We propose DENDRO, an analysis method for scRNA-seq data that detects genetically distinct subclones, assigns each single cell to a subclone, and reconstructs the phylogenetic tree describing the tumor’s evolutionary history. DENDRO utilizes information from single nucleotide mutations in transcribed regions and accounts for technical noise and expression stochasticity at the single cell level. The accuracy of DENDRO was benchmarked on spike-in datasets and on scRNA-seq data with known subpopulation structure. We applied DENDRO to delineate subclonal expansion in a mouse melanoma model in response to immunotherapy, highlighting the role of neoantigens in treatment response. We also applied DENDRO to primary and lymph-node metastasis samples in breast cancer, where the new approach allowed us to better understand the relationship between genetic and transcriptomic intratumor variation. Part 2: Recent technological advances allow the simultaneous profiling, across many cells in parallel, of multiple omics features in the same cell. In particular, high throughput quantification of the transcriptome and a selected panel of cell surface proteins in the same cell is now feasible through the REAP-seq and CITE-seq protocols. Yet, due to technological barriers and cost considerations, most single cell studies, including Human Cell Atlas (HCA) project, quantify the transcriptome only and do not have cell-matched measurements of relevant surface proteins that can serve as integral markers of cellular function and targets for therapeutic intervention. Here we propose cTP-net (single cell Transcriptome to Protein prediction with deep neural network), a transfer learning approach based on deep neural networks, that imputes surface protein abundances for scRNA-seq data. Through comprehensive benchmark evaluations and applications to HCA and AML data sets, we show that cTP-net outperform existing methods and can transfer information from training data to accurately impute 24 immunophenotype markers, which achieve a more detailed characterization of cellular state and cellular phenotypes than transcriptome measurements alone. cTP-net relies, for model training, on accumulating public data of cells with paired transcriptome and surface protein measurements. Part 3: Copy number variations (CNVs) are gains and losses of DNA segments that are highly associated with multiple diseases. The Penn Medicine BioBank stores SNP-array and NGS data for more than 10000 individuals across ethnicity and conditions, providing a rich resource for CNV discovery and analysis. This type of experiment design fits perfectly for CNV detection tool - Integrated Copy Number Variation caller (iCNV), which I developed as my master thesis. The distinguishing feature of iCNV includes adaptation of platform specific normalization, utilization of allele specific reads from sequencing and integration of matched NGS and SNP-array data by a Hidden Markov Model (HMM). We applied iCNV on Penn Medicine BioBank data set, calling CNV over more than 10000 individuals (~2000 AFR, ~8000 EUR) with different phenotypes. iCNV detected on average 34.1 deletions and 11.3 duplications per EUR sample, and 38 deletions and 10.6 duplications per AFR sample. iCNV calling results show great improvement in detection sensitivity and specificity comparing to single platform detection method. Penn Medicine BioBank CNV sets by iCNV provide a rich database for researchers to study the relationship between diseases phenotypes and CNV across ethnicity and conditions.

Subject Area

Bioinformatics|Statistics|Computer science|Artificial intelligence

Recommended Citation

Zhou, Zilu, "Statistical Methods for Multi-Omics Inference from Single Cell Transcriptome" (2020). Dissertations available from ProQuest. AAI27836300.