Scalable Machine Learning Methods For The Analysis Of Single-Cell Transcriptomics And Multiomics Data
Degree type
Graduate group
Discipline
Subject
Machine Learning
Multiomics
Single Cell
Statistics
Transcriptomics
Biostatistics
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Transcriptomics and proteomics-based expression profiling technologies have become increasingly popular, more affordable, and more accurate in recent years. Expression profiling of expression at the single-cell resolution allows investigators to identify rare cell subtypes in human tissue which would be otherwise confounded in lower-resolution, bulk sequencing technologies. Previously, investigators studied human cell populations by profiling RNA expression in single cells using single-cell RNA sequencing (scRNA-seq) technologies. More recently, multi-modality sequencing technologies such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) have emerged, which allow investigators to profile multiple forms of biological expression (in this case RNA and protein expression) simultaneously in the same cells. Investigators can study human biology now with greater detail than ever before, but challenges remain. (1) Cell subpopulations are not always neatly separated from one another, which makes cell type classification difficult. (2) Technical batch effects also often plague scRNA-seq studies and confound real biological signals. (3) Multi-modality technologies are excellent but remain expensive to do at scale. In this work, we seek to address these various challenges and difficulties associated with scRNA-seq and CITE-seq analyses. To address challenge (1), we propose a smooth pseudotemporal modeling approach which characterizes a cell’s identity as a mixture of two discrete identities, allowing for a continuous sliding-scale cell type rather than requiring cells to separate into discrete types. To address challenge (2), we propose an augmented autoencoder which uses a self-supervised Kullback–Leibler divergence, along with a specialized branching architecture to correct for batch effects in the full gene expression feature space. Lastly, to address challenge (3), we develop a hybrid feedforward-recurrent neural network approach which supports protein prediction, imputation, embedding, uncertainty quantification, and cell type label transfer, allowing the user to use reference CITE-seq datasets to predict and study protein expression in larger single modality RNA-only data. We validate the utility of each of our approaches using real datasets with gold standard true expression and experimentally validated cell type labels. We also demonstrate real use cases for our methods, such as improving downstream pseudotime analyses using batch correction and identifying immune response biomarkers to an H1N1 vaccine.