Advancing Deep Generative Imputation Models for Complex Incomplete Data
Degree type
Graduate group
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Missing data are ubiquitous in real-world applications and, if not adequately handled, may lead to the loss of information and biased findings in downstream analysis. Imputation is one popular method to handle missing values. Deep learning methods have become more and more powerful and beat traditional methods. Imputation is not an exception. This thesis uses modern deep generative models to build three imputation methods to handle complex missing values in different scenarios. In the first part, we propose Multiple Imputation via Generative Adversarial Network (MI-GAN), a deep learning-based (specifically, a GAN-based) multiple imputation method, that can work under the missing at random (MAR) mechanism with theoretical support. MIGAN is designed for high-dimensional blockwise missing data. This kind of dataset is prevalent in multi-omics data in which each omics modality is measured in a different subset of study subjects. MI-GAN leverages recent progress in conditional generative adversarial neural networks and shows strong performance matching existing state-of-the-art imputation methods in terms of imputation error. In particular, MI-GAN significantly outperforms other imputation methods in the sense of statistical inference and computational speed. It is well-known that training for GAN-based models is not stable and they suffer from the mode collapse issue.To avoid this issue, we leverage recent advances in the neural network Gaussian process (NNGP) theory from a Bayesian viewpoint in the second part. At a high level, NNGP theory states that a neural network is equivalent to a Gaussian process when certain conditions are satisfied. The network itself does not need to be trained and the corresponding Gaussian process could be used for inference. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. MI-NNGP also aims to impute missing values in blockwise missing data. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets, in terms of imputation error, statistical inference, robustness to missing rates, and computation costs, under three missing data mechanisms, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In the third part, we aim to impute missing values in temporal electronic health records (EHR) data. We propose the Similarity-Aware Diffusion Model-Based Imputation (SADI), a novel imputation method that leverages the diffusion model and utilizes information across dependent variables. We apply SADI to impute incomplete temporal EHR data and propose a similarity-aware denoising function, which includes a self-attention mechanism to model the correlations between time points, features, and similar patients. To the best of our knowledge, this is the first time that the information of similar patients is directly used to construct imputation for incomplete temporal EHR data. Our extensive experiments on two datasets, the Critical Path For Alzheimer’s Disease (CPAD) data and the PhysioNet Challenge 2012 data, show that SADI outperforms the current SOTA under various missing data mechanisms, including MCAR, MAR, and MNAR.