Privacy-Preserving Distributed Regression Algorithms For Analysis Of Multi-Site Real-World Data

Mackenzie John Edmondson, University of Pennsylvania


Real-world data, including electronic health records and administrative claims data, are widelyused in modern healthcare research to generate real-world evidence for improving patient care. The widespread availability of observational data from a variety of institutions has prompted many large-scale, multi-site studies in recent years. Studies incorporating data from multiple institutions often attain results more generalizable than those from single-site studies and offer improved power for studying rare outcomes or exposures. Various challenges concerning patient-level data sharing, primarily those related to data privacy, have made distributed data analysis a practical alternative to analyzing centralized data in multi-site studies. Under a distributed data analysis framework, patient-level data are not shared across institutions. Instead, aggregated data are shared and communicated to a coordinating site to obtain analysis results. While methods for performing distributed analyses are increasingly available, analytical methods for analyzing binary and count outcomes are limited. In this work, we propose two distributed regression algorithms for modeling count outcomes in multi-site studies. The first algorithm uses distributed quasi-Poisson regression to model counts while accounting for institution-specific heterogeneity in the outcome. The second uses distributed hurdle regression to model counts subject to zero-inflation. Both algorithms are communication efficient and highly accurate, requiring at most two or three rounds of communication among participating institutions and achieving results close to those obtained using pooled regression of all patient-level data, a method usable only if data are centralized. We evaluate the performance of each method through simulations and applications to real-world clinical research networks. Finally, we illustrate a novel application of a distributed generalized linear mixed modeling algorithm with binary outcomes to study the effect of admitting hospital on racial disparities in mortality for patients hospitalized with COVID-19 via counterfactual modeling.