Learning from Multiple and Heterogeneous Datasets
Degree type
Graduate group
Discipline
Subject
Multi-task learning
Single-cell biology
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The advances in data-acquisition technologies have enabled statisticians to have access to multiple datasets with both globally overlapping and individually variable information. Focusing on applications in single-cell multi-omics, this dissertation concerns statistical methodologies and theories for the estimation of both the global and the individualized structures when multiple and heterogeneous datasets are available. This dissertation is composed of three parts. In the first part, we present MARIO, a robust pipeline for integrative analyses of multi-modal single-cell data that is particularly successful in low signal-to-noise (SNR) ratio scenarios. Currently available tools for single-cell data integration are mainly designed for transcriptomics data and generally rely upon a large number of shared features across datasets. Those methods are unsuitable when applied to single-cell proteomic datasets, due to the limited number of parameters simultaneously accessed, and the lack of shared markers across these experiments. Our algorithmic pipeline takes into account both shared and distinct features and consists of vital filtering steps to avoid sub-optimal matching. MARIO accurately matches and integrates data from different single-cell proteomic and multi-modal methods, including spatial techniques, and has cross-species capabilities. The rest parts are theoretical investigations of two important modules of the MARIO pipeline. The second part discusses minimax optimal community detection in a multi-layer stochastic block model. We characterize the minimax rate for estimating both the global and individualized community structures. We propose a spectral initialization + maximum a posteriori based refinement algorithm that enjoys minimax optimality. This algorithm serves as a key step in MARIO’s quality control steps. The third part is about minimax optimal estimation of a latent correspondence between two datasets where one is a noisy permuted version of the other. We characterize the minimax rate of this problem. We further prove a highly intuitive algorithm that solves a linear assignment problem in the SVD-reduced space achieves consistency, and sometimes minimax optimality under regularity conditions. This algorithm is one of the major ingredients that enables MARIO’s robust performance in low SNR scenarios.