Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Epidemiology & Biostatistics

First Advisor

Yong Chen


Real-world data, including electronic health records and administrative claims data, are widelyused in modern healthcare research to generate real-world evidence for improving patient care. The widespread availability of observational data from a variety of institutions has prompted many large-scale, multi-site studies in recent years. Studies incorporating data from multiple institutions often attain results more generalizable than those from single-site studies and offer improved power for studying rare outcomes or exposures. Various challenges concerning patient-level data sharing, primarily those related to data privacy, have made distributed data analysis a practical alternative to analyzing centralized data in multi-site studies. Under a distributed data analysis framework, patient-level data are not shared across institutions. Instead, aggregated data are shared and communicated to a coordinating site to obtain analysis results. While methods for performing distributed analyses are increasingly available, analytical methods for analyzing binary and count outcomes are limited. In this work, we propose two distributed regression algorithms for modeling count outcomes in multi-site studies. The first algorithm uses distributed quasi-Poisson regression to model counts while accounting for institution-specific heterogeneity in the outcome. The second uses distributed hurdle regression to model counts subject to zero-inflation. Both algorithms are communication efficient and highly accurate, requiring at most two or three rounds of communication among participating institutions and achieving results close to those obtained using pooled regression of all patient-level data, a method usable only if data are centralized. We evaluate the performance of each method through simulations and applications to real-world clinical research networks. Finally, we illustrate a novel application of a distributed generalized linear mixed modeling algorithm with binary outcomes to study the effect of admitting hospital on racial disparities in mortality for patients hospitalized with COVID-19 via counterfactual modeling.


Available to all on Friday, August 09, 2024

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Included in

Biostatistics Commons