Date of Award
Doctor of Philosophy (PhD)
Genomics & Computational Biology
Jason H. Moore
Casey S. Greene
The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning.
Beaulieu-Jones, Brett Kreigh, "Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record" (2017). Publicly Accessible Penn Dissertations. 2955.