TRANSFER LEARNING IN CLASSIFICATION AND REGRESSION WITH SUMMARY STATISTICS

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology and Biostatistics
Discipline
Biology
Statistics and Probability
Subject
Classification
GWAS
High-dimensional models
Linear discriminant analysis
Summary statistics
Transfer learning
Funder
Grant number
License
Copyright date
2023
Distributor
Related resources
Author
Zheng, Haotian
Contributor
Abstract

Transfer learning is one of the most active research areas in statistical learning. In this dissertation, we developed transfer learning methods in high dimensional setting. In Chapter 2, we develop a transfer learning method for linear discriminant analysis (Trans-LDA) that effectively utilizes information from auxiliary data sets in order to build a better classification rule for the target study. The methods allow for both homogeneous and heterogeneous covariance matrices across different studies. In addition, an adaptive method together with model aggregation is introduced that identifies the possible informative data sets in transfer learning. We show that under some assumptions, Trans-LDA has smaller error rate in estimating the discriminant direction, and smaller classification error. We illustrate the proposed methods by building a classification for cardiovascular risk of different patient groups of chronic kidney patients based on blood proteomics data and show improved classification by leveraging data sets from different patients’ subgroups. We consider in Chapter 3 estimation and prediction of a high-dimensional linear regression model in the setting of transfer learning, where we only observe summary statistics in the auxiliary studies, together with external data for estimation of linkage disequilibrium (LD). We develop a method for estimation of the regression coefficient and PRS in the target model based on data in the target study, summary statistics in auxiliary studies, and external data for estimating the LD matrix. We show improvement in estimation of the model parameter and PRS when the summary statistics of auxiliary studies are used in transfer learning. However, the convergence rate for estimation is slower than transfer learning methods with individual-level data. We show that such transfer learning methods lead to better predictions of lipid phenotypes using data from Penn Medicine Biobank and the GWAS summary statistics from UK Biobank.

Advisor
Li, Hongzhe
Date of degree
2023
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation