ADVANCING ROBUST AND FAIR STATISTICAL AND MACHINE LEARNING MODELS FOR INCOMPLETE DATA

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Applied Mathematics and Computational Science
Discipline
Mathematics
Statistics and Probability
Computer Sciences
Subject
Algorithmic Fairness
Machine Learning
Missing Data
Statistics
Funder
Grant number
License
Copyright date
2023
Distributor
Related resources
Author
Zhang, Yiliang
Contributor
Abstract

Missing data is pervasive in the era of big data, and if inadequately addressed, it can lead to biased findings and adversely affect data-driven decision-making. To counteract this, numerous missing value imputation methods have been developed. However, many of these methods become less effective in high-dimensional settings due to limited information from the dataset, undermining the trustworthiness of imputation results. This thesis investigates missing data imputation, specifically considering high dimensionality and fairness. The first chapter provides an overview of existing missing data imputation methods and introduces the associated challenges. In the second chapter, we approach missing data imputation from a nonparametric density estimation perspective and develop Max-Random Forest, a new nonparametric density estimator. Under Lipschitz assumptions, we present non-asymptotic bounds for estimation errors in both $L^2$ distance and squared Hellinger distance for a simplified version of the proposed method. We also extend the Max-Random Forest algorithm to a conditional density estimator, which accurately estimates and samples from high-dimensional conditional densities. We demonstrate the utility of the proposed density estimator and imputation method, showing that Max-Random Forest achieves state-of-the-art performance. The third chapter examines the trustworthiness of missing data imputation. We conduct the first known research into potential biases generated through imputation. By analyzing the performance of imputation methods across three commonly used datasets, we show that unfairness in missing value imputation is widespread and may be attributed to multiple factors. Our results suggest that a thorough investigation of these factors can offer valuable insights for mitigating unfairness in missing data imputation. Motivated by the previous chapter, the fourth chapter develops fairness-aware imputation methods based on deep generative models. We formulate fairness in the imputation process from both causal and non-causal perspectives. From a non-causal standpoint, we first derive a positive information-theoretic lower bound for imputation fairness when using the ground-truth conditional distribution for missing data imputation. We then propose a novel missing data imputation model, the Fairness-Aware Imputation GAN (FIGAN), which delivers accurate imputations while maintaining imputation fairness. From a causal perspective, we provide a theoretical analysis for a specific class of datasets to achieve causal fairness and introduce a new missing data imputation method, the Causal-Equality-Driven Imputation Network (CEDIN), which is causal fairness-aware. We demonstrate the effectiveness of CEDIN through theoretical analysis and empirical studies, underscoring the importance of considering causal inference in addressing fairness in missing data imputation. In the fifth chapter, we explore the problem of estimating fairness using incomplete data. A prevalent analytical approach for handling missing data is to use only complete cases, i.e., observations with fully observed features, to train a prediction algorithm. However, depending on the missing data mechanism, the distribution of complete cases and complete data may differ significantly. When the goal is to develop a fair algorithm for the complete data domain with no missing values, an algorithm that is fair in the complete case domain may exhibit considerable bias towards marginalized groups in the complete data domain. We provide upper and lower bounds on the fairness estimation error and conduct numerical experiments to evaluate our theoretical results. Our work presents the first known theoretical results on fairness guarantees in the analysis of incomplete data.

Advisor
Long, Qi
Su, Weijie
Date of degree
2023
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation