Mining for Health: Advancing trustworthy statistical and machine learning methods for complex electronic health records data
Degree type
Graduate group
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Electronic health records (EHRs) consist of data that are collected each time a patient interacts with the healthcare system. These data may consist of structured data such as labs and vitals, codified data such as diagnoses, prescriptions, or procedures, and unstructured data such as doctor's notes and pathology reports. They were originally meant for billing purposes, but recently, EHRs have shown great promise for being leveraged in early disease detection, treatment evaluation, and medical information discovery. They are, however, very complex– they contain both unstructured and structured data, and are collected at irregular time intervals and varying frequencies. EHRs can reflect health inequity–for example, patients with less access to healthcare, often people of color or with lower socioeconomic status, tend to have more incomplete data in EHRs. Many of these issues can contribute to biased data in EHRs. As such, EHRs data present daunting analytical challenges. If the goal is to build prediction models for clinical decision support, the complexity of this data leads to a myriad of challenges such as the inability to use classical statistical models, missing data, algorithmic fairness, and explainability. But harnessing this complex structure of EHRs yields many benefits: less time spent pre-processing, and the ability to leverage rich, contextual information from patients to inform clinical decisions. This dissertation aims to address the challenges of complex EHRs: we begin by investigating the use of various language models for codified data in disease prediction models. We then develop a novel framework for simulating missing data in EHRs that reflect varying levels of access to healthcare, and demonstrate the impact of missingness on model performance. We investigate the impact of informative missingness on the outcomes of COVID-19 patients from multiple health systems across the globe. And finally, we develop our own deep learning framework for leveraging all aspects of complex EHRs data, incorporating recent advances in generative and explainable AI. We demonstrate the superiority of our method in terms of model performance for 1-year mortality prediction.