Topics in Generalized Correlation Analysis and Clustering
In the era of big data, generating large volumes of data from multiple sources on a shared group of subjects has become increasingly prevalent. The availability of abundant computational resources and advancements in data acquisition technology have made the integration of information from multimodal measurements essential. The objective of this integration is to develop efficient algorithms that facilitate a deeper understanding of shared subjects, despite variations in the contexts of multimodal information. This dissertation explores multimodal data analysis from the standpoints of algorithms, theories, and applications in various fields. It consists of three main components. The first part of the dissertation studies the topic of sparse generalized correlation analysis (sparse GCA). We first formulate sparse GCA as generalized eigenvalue problems at both population and sample levels via a careful choice of normalization constraints. Subsequently, we present a computationally efficient algorithm for solving sparse GCA when there are potentially multiple generalized correlation tuples in data and the loading matrix has a small number of nonzero rows. We also establish the theoretical guarantees of the proposed algorithm and provide a corresponding information-theoretic lower bound for estimating GCA loading matrices. In the second part of the dissertation, we delve deeper into the application of sparse GCA on multimodal datasets. We develop a modified algorithm for solving sparse GCA in a layerwise fashion when the row sparsity condition is violated. Utilizing a nested cross-validation procedure, we apply the layerwise sparse GCA to the Philadelphia Neurodevelopmental Cohort (PNC) study. This enables us to reveal the correlation structure of covariates across multiple datasets, encompassing neuroimaging, a wide range of clinical and cognitive phenotypes, and demographic information. In the third part of the dissertation, we study the problem of cell type clustering with multimodal information. We introduce CellSNAP, a clustering pipeline that integrates feature expression, cellular neighborhood, and local tissue-level morphology information to generate a novel embedding that combines these three types of information. To showcase the effectiveness of CellSNAP, we apply it to the murine spleen dataset, which comprises multimodal measurements on single-cell data.