Applying Language Models To Patient Health Records: Acronym Expansion, Long Document Classification and Explainable Predictions
Degree type
Graduate group
Discipline
Subject
Explainable Predictions
Large Language Models
Machine Learning
Natural Language Processing
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The health industry is experiencing a digital transformation, with Electronic Health Records (EHRs) becoming central repositories for an ever-growing volume of patient data. While EHR clinical notes offer rich, detailed insights into patient conditions, treatments and outcomes, extracting meaningful information from these notes remains a significant challenge due to their unstructured nature, widespread occurrence of acronyms and medical jargon, and varying writing styles. This dissertation addresses three challenges in applying Machine Learning (ML) and Natural Language Processing (NLP) to clinical text, specifically focusing on (1) understanding medical acronyms in context, (2) building models that can analyze multi-modal data (structured and unstructured patient EHR data) that includes lengthy clinical notes to study a stigmatized condition, namely opioid prescribing patterns and opioid use disorder (OUD) risk, and (3) developing explainable models that utilize clinical notes for complex diagnoses such as dementia. This thesis has three main contributions. The first contribution introduces CLASSE GATOR, a novel system for disambiguating medical acronyms using a distantly supervised approach. Our method leverages medical research papers for contextual learning and eliminates the need for expensive manual annotation, achieving an average accuracy of 63% on Mimic-III clinical notes. We dramatically reduced the cost for clinical annotations of these acronyms. The second contribution is the development of a novel deep learning architecture that leverages structured EHR data (e.g., demographics, billing code diagnosis data) and unstructured EHR data (i.e., clinical notes) for predicting both a stigmatized health disorder, namely opioid prescription likelihood and (OUD) diagnosis from EHR data, achieving F1 scores of 0.88 and 0.82 respectively. The model uniquely handles multiple data types and variable-length clinical notes through a combination of feed-forward layers, transformers, and a Hierarchical Attention Model built on ClinicalBERT. Finally, we present a new approach to improving interpretability in clinical prediction models using Concept Bottleneck Models (CBMs). By leveraging the Oxford Textbook of Medicine and GPT-4, we extract 254 clinically relevant features for dementia and demonstrate superior performance in dementia type prediction (0.72 accuracy) compared to baseline models (0.64 for an ngram-Logistic Regression baseline and 0.48 for a GPT-4 baseline) while maintaining interpretability. Across all contributions, this dissertation emphasizes scalable, interpretable methods that minimize reliance on manual data curation, aiming to create computational tools that are both effective and deployable across diverse healthcare settings.
Advisor
Boland, Mary Regina