Learning from Multiple and Heterogeneous Datasets

Chen, Shuxiao

Learning from Multiple and Heterogeneous Datasets

Files

Chen_upenngdas_0175C_15543.pdf (29.71 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Statistics

Discipline

Statistics and Probability

Subject

Integrative data analysis
Multi-task learning
Single-cell biology

Copyright date

2022

Permalink

https://repository.upenn.edu/handle/20.500.14332/59697

View all metadata

Author

Chen, Shuxiao

Abstract

The advances in data-acquisition technologies have enabled statisticians to have access to multiple datasets with both globally overlapping and individually variable information. Focusing on applications in single-cell multi-omics, this dissertation concerns statistical methodologies and theories for the estimation of both the global and the individualized structures when multiple and heterogeneous datasets are available. This dissertation is composed of three parts. In the first part, we present MARIO, a robust pipeline for integrative analyses of multi-modal single-cell data that is particularly successful in low signal-to-noise (SNR) ratio scenarios. Currently available tools for single-cell data integration are mainly designed for transcriptomics data and generally rely upon a large number of shared features across datasets. Those methods are unsuitable when applied to single-cell proteomic datasets, due to the limited number of parameters simultaneously accessed, and the lack of shared markers across these experiments. Our algorithmic pipeline takes into account both shared and distinct features and consists of vital filtering steps to avoid sub-optimal matching. MARIO accurately matches and integrates data from different single-cell proteomic and multi-modal methods, including spatial techniques, and has cross-species capabilities. The rest parts are theoretical investigations of two important modules of the MARIO pipeline. The second part discusses minimax optimal community detection in a multi-layer stochastic block model. We characterize the minimax rate for estimating both the global and individualized community structures. We propose a spectral initialization + maximum a posteriori based refinement algorithm that enjoys minimax optimality. This algorithm serves as a key step in MARIO’s quality control steps. The third part is about minimax optimal estimation of a latent correspondence between two datasets where one is a noisy permuted version of the other. We characterize the minimax rate of this problem. We further prove a highly intuitive algorithm that solves a linear assignment problem in the SVD-reduced space achieves consistency, and sometimes minimax optimality under regularity conditions. This algorithm is one of the major ingredients that enables MARIO’s robust performance in low SNR scenarios.

Advisor

Ma, Zongming

Date of degree

2022

Collection

Dissertations and Theses