Applying Language Models To Patient Health Records: Acronym Expansion, Long Document Classification and Explainable Predictions

Loading...
Thumbnail Image
Degree type
PhD
Graduate group
Computer and Information Science
Discipline
Computer Sciences
Subject
Electronic Health Records
Explainable Predictions
Large Language Models
Machine Learning
Natural Language Processing
Funder
Grant number
License
Copyright date
01/01/2025
Distributor
Related resources
Author
Kashyap, Aditya, M
Contributor
Abstract

The health industry is experiencing a digital transformation, with Electronic Health Records (EHRs) becoming central repositories for an ever-growing volume of patient data. While EHR clinical notes offer rich, detailed insights into patient conditions, treatments and outcomes, extracting meaningful information from these notes remains a significant challenge due to their unstructured nature, widespread occurrence of acronyms and medical jargon, and varying writing styles. This dissertation addresses three challenges in applying Machine Learning (ML) and Natural Language Processing (NLP) to clinical text, specifically focusing on (1) understanding medical acronyms in context, (2) building models that can analyze multi-modal data (structured and unstructured patient EHR data) that includes lengthy clinical notes to study a stigmatized condition, namely opioid prescribing patterns and opioid use disorder (OUD) risk, and (3) developing explainable models that utilize clinical notes for complex diagnoses such as dementia. This thesis has three main contributions. The first contribution introduces CLASSE GATOR, a novel system for disambiguating medical acronyms using a distantly supervised approach. Our method leverages medical research papers for contextual learning and eliminates the need for expensive manual annotation, achieving an average accuracy of 63% on Mimic-III clinical notes. We dramatically reduced the cost for clinical annotations of these acronyms. The second contribution is the development of a novel deep learning architecture that leverages structured EHR data (e.g., demographics, billing code diagnosis data) and unstructured EHR data (i.e., clinical notes) for predicting both a stigmatized health disorder, namely opioid prescription likelihood and (OUD) diagnosis from EHR data, achieving F1 scores of 0.88 and 0.82 respectively. The model uniquely handles multiple data types and variable-length clinical notes through a combination of feed-forward layers, transformers, and a Hierarchical Attention Model built on ClinicalBERT. Finally, we present a new approach to improving interpretability in clinical prediction models using Concept Bottleneck Models (CBMs). By leveraging the Oxford Textbook of Medicine and GPT-4, we extract 254 clinically relevant features for dementia and demonstrate superior performance in dementia type prediction (0.72 accuracy) compared to baseline models (0.64 for an ngram-Logistic Regression baseline and 0.48 for a GPT-4 baseline) while maintaining interpretability. Across all contributions, this dissertation emphasizes scalable, interpretable methods that minimize reliance on manual data curation, aiming to create computational tools that are both effective and deployable across diverse healthcare settings.

Advisor
Callison-Burch, Chris
Boland, Mary Regina
Date of degree
2025
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation