EXTRACTING INSIGHTS FROM ELECTRONIC HEALTH RECORDS USING OPTIMIZED LARGE LANGUAGE MODELS

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Bioengineering
Discipline
Engineering
Bioinformatics
Data Science
Subject
Artificial Intelligence
Clinical Translation
Epilepsy
Informatics
Large Language Models
Natural Language Processing
Funder
Grant number
License
Copyright date
2024
Distributor
Related resources
Author
Xie, Kevin
Contributor
Abstract

The Electronic Health Record (EHR) contains extensive patient clinical information, including demographic and socioeconomic information; laboratory, imaging and diagnostic results; treatment plans; and comprehensive records of patient medical histories. This wealth of information makes the EHR especially suitable for retrospective studies by allowing clinicians and researchers to draw new conclusions, potentially at reduced cost, by looking backwards through time across information gathered during patient-healthcare interactions. However, the most valuable information is captured within unstructured free-text clinical notes, precluding simple data mining methods and instead favoring time-consuming and expensive manual chart review. To address this gap, I developed a Natural Language Processing (NLP) approach that uses modern techniques to drive large-scale retrospective clinical informatics research through the EHR. I demonstrate these techniques on Epilepsy, a neurological disorder with complex phenotypes and heterogeneous patient populations. First, I created an NLP pipeline by finetuning Transformer language models to read, understand, and extract critical epilepsy outcome measures – seizure freedom, seizure frequency, and date of last seizure, from unstructured note text; this pipeline was found to rival trained humans in this task. I further tested the generalizability of these models in new clinical contexts. Using these models, I extracted seizure outcomes from the EHR in our health system. I used this data to closely study long-term seizure dynamics of patients with epilepsy, finding that the majority of them experienced periods of seizure freedom interspersed with epileptic episodes. I also used this data to both investigate demographic biases in transformer models, and elucidate how seizure outcomes were influenced by demographic factors; I found a lack of evidence of model bias, and that female patients, patients on public insurance, and patients from lower-income zip-codes fare substantially worse than their counterparts. Finally, I conducted a large-scale retrospective comparative effectiveness trial of anti-seizure medications using a rigorous causal inference and statistical framework. The results of this thesis demonstrate that NLP can unlock the information stored in the EHR to conduct clinical informatics research at scale.

Advisor
Litt, Brian
Date of degree
2024
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation