Scalable Machine Learning Methods For The Analysis Of Single-Cell Transcriptomics And Multiomics Data

Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology & Biostatistics
Deep Learning
Machine Learning
Single Cell
Grant number
Copyright date
Related resources
Lakkis, Justin

Transcriptomics and proteomics-based expression profiling technologies have become increasingly popular, more affordable, and more accurate in recent years. Expression profiling of expression at the single-cell resolution allows investigators to identify rare cell subtypes in human tissue which would be otherwise confounded in lower-resolution, bulk sequencing technologies. Previously, investigators studied human cell populations by profiling RNA expression in single cells using single-cell RNA sequencing (scRNA-seq) technologies. More recently, multi-modality sequencing technologies such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) have emerged, which allow investigators to profile multiple forms of biological expression (in this case RNA and protein expression) simultaneously in the same cells. Investigators can study human biology now with greater detail than ever before, but challenges remain. (1) Cell subpopulations are not always neatly separated from one another, which makes cell type classification difficult. (2) Technical batch effects also often plague scRNA-seq studies and confound real biological signals. (3) Multi-modality technologies are excellent but remain expensive to do at scale. In this work, we seek to address these various challenges and difficulties associated with scRNA-seq and CITE-seq analyses. To address challenge (1), we propose a smooth pseudotemporal modeling approach which characterizes a cell’s identity as a mixture of two discrete identities, allowing for a continuous sliding-scale cell type rather than requiring cells to separate into discrete types. To address challenge (2), we propose an augmented autoencoder which uses a self-supervised Kullback–Leibler divergence, along with a specialized branching architecture to correct for batch effects in the full gene expression feature space. Lastly, to address challenge (3), we develop a hybrid feedforward-recurrent neural network approach which supports protein prediction, imputation, embedding, uncertainty quantification, and cell type label transfer, allowing the user to use reference CITE-seq datasets to predict and study protein expression in larger single modality RNA-only data. We validate the utility of each of our approaches using real datasets with gold standard true expression and experimentally validated cell type labels. We also demonstrate real use cases for our methods, such as improving downstream pseudotime analyses using batch correction and identifying immune response biomarkers to an H1N1 vaccine.

Mingyao Li
Date of degree
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher DOI
Journal Issue
Recommended citation