Privacy-Preserving Distributed Regression Algorithms For Analysis Of Multi-Site Real-World Data

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology & Biostatistics
Discipline
Subject
data privacy
distributed algorithm
electronic health records
Poisson regression
zero-inflation
Biostatistics
Statistics and Probability
Funder
Grant number
License
Copyright date
2022-09-09T20:21:00-07:00
Distributor
Related resources
Author
Edmondson, Mackenzie John
Contributor
Abstract

Real-world data, including electronic health records and administrative claims data, are widelyused in modern healthcare research to generate real-world evidence for improving patient care. The widespread availability of observational data from a variety of institutions has prompted many large-scale, multi-site studies in recent years. Studies incorporating data from multiple institutions often attain results more generalizable than those from single-site studies and offer improved power for studying rare outcomes or exposures. Various challenges concerning patient-level data sharing, primarily those related to data privacy, have made distributed data analysis a practical alternative to analyzing centralized data in multi-site studies. Under a distributed data analysis framework, patient-level data are not shared across institutions. Instead, aggregated data are shared and communicated to a coordinating site to obtain analysis results. While methods for performing distributed analyses are increasingly available, analytical methods for analyzing binary and count outcomes are limited. In this work, we propose two distributed regression algorithms for modeling count outcomes in multi-site studies. The first algorithm uses distributed quasi-Poisson regression to model counts while accounting for institution-specific heterogeneity in the outcome. The second uses distributed hurdle regression to model counts subject to zero-inflation. Both algorithms are communication efficient and highly accurate, requiring at most two or three rounds of communication among participating institutions and achieving results close to those obtained using pooled regression of all patient-level data, a method usable only if data are centralized. We evaluate the performance of each method through simulations and applications to real-world clinical research networks. Finally, we illustrate a novel application of a distributed generalized linear mixed modeling algorithm with binary outcomes to study the effect of admitting hospital on racial disparities in mortality for patients hospitalized with COVID-19 via counterfactual modeling.

Advisor
Yong Chen
Date of degree
2021-01-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation