Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

Beaulieu-Jones, Brett Kreigh

Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

Files

BEAULIEUJONES_upenngdas_0175C_13002.pdf (8.77 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Genomics & Computational Biology

Subject

deep learning
electronic health record
electronic phenotyping
machine learning
semi-supervised learning
Bioinformatics
Genetics

Copyright date

2018-09-27T20:17:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/29926

View all metadata

Author

Beaulieu-Jones, Brett Kreigh

Abstract

The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning.

Advisor

Jason H. Moore
Casey S. Greene

Date of degree

2017-01-01

Collection

Dissertations and Theses