Scalable Machine Learning Methods For The Analysis Of Single-Cell Transcriptomics And Multiomics Data

Lakkis, Justin

Scalable Machine Learning Methods For The Analysis Of Single-Cell Transcriptomics And Multiomics Data

Files

Lakkis_upenngdas_0175C_14903.pdf (52.95 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology & Biostatistics

Subject

Deep Learning
Machine Learning
Multiomics
Single Cell
Statistics
Transcriptomics
Biostatistics

Copyright date

2022-09-17T20:21:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/32017

View all metadata

Author

Lakkis, Justin

Abstract

Transcriptomics and proteomics-based expression profiling technologies have become increasingly popular, more affordable, and more accurate in recent years. Expression profiling of expression at the single-cell resolution allows investigators to identify rare cell subtypes in human tissue which would be otherwise confounded in lower-resolution, bulk sequencing technologies. Previously, investigators studied human cell populations by profiling RNA expression in single cells using single-cell RNA sequencing (scRNA-seq) technologies. More recently, multi-modality sequencing technologies such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) have emerged, which allow investigators to profile multiple forms of biological expression (in this case RNA and protein expression) simultaneously in the same cells. Investigators can study human biology now with greater detail than ever before, but challenges remain. (1) Cell subpopulations are not always neatly separated from one another, which makes cell type classification difficult. (2) Technical batch effects also often plague scRNA-seq studies and confound real biological signals. (3) Multi-modality technologies are excellent but remain expensive to do at scale. In this work, we seek to address these various challenges and difficulties associated with scRNA-seq and CITE-seq analyses. To address challenge (1), we propose a smooth pseudotemporal modeling approach which characterizes a cell’s identity as a mixture of two discrete identities, allowing for a continuous sliding-scale cell type rather than requiring cells to separate into discrete types. To address challenge (2), we propose an augmented autoencoder which uses a self-supervised Kullback–Leibler divergence, along with a specialized branching architecture to correct for batch effects in the full gene expression feature space. Lastly, to address challenge (3), we develop a hybrid feedforward-recurrent neural network approach which supports protein prediction, imputation, embedding, uncertainty quantification, and cell type label transfer, allowing the user to use reference CITE-seq datasets to predict and study protein expression in larger single modality RNA-only data. We validate the utility of each of our approaches using real datasets with gold standard true expression and experimentally validated cell type labels. We also demonstrate real use cases for our methods, such as improving downstream pseudotime analyses using batch correction and identifying immune response biomarkers to an H1N1 vaccine.

Advisor

Mingyao Li

Date of degree

2021-01-01

Collection

Dissertations and Theses