Distributed algorithms and statistical inference for multi-site analyses: unfolding the complexity of heterogeneity in real-world data

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology and Biostatistics
Discipline
Biology
Subject
Data Heterogeneity
Distributed Algorithms
Multi-site Analyses
Real-world Data
Funder
Grant number
License
Copyright date
01/01/2024
Distributor
Related resources
Author
Tong, Jiayi
Contributor
Abstract

In the era of expanding real-world data (RWD) availability from distributed research networks (DRNs), leveraging large-scale data has become essential to generate evidence for clinical inquiries relevant to stakeholders within the healthcare system. To provide answers to questions about hospital and treatment options, medication queries, and others, we still face practical challenges in analyzing RWD, such as reporting bias, confounders, and rare events. It is particularly challenging to integrate data from multiple clinical sites within DRNs due to data privacy concerns, patient heterogeneity -- also known as the case-mix situation or patient-mix situation -- and the communication cost. In this work, centered on generating real-world clinical evidence from DRN data, our objective is to develop several distributed learning frameworks. These frameworks are specifically designed to provide insights into comparative effectiveness research, health system performance assessment, and evaluation of site-of-care-related racial disparities, all while addressing the complexities of patient heterogeneity in real-world data settings. In our initial study, we acknowledged the heterogeneity of event rates across multiple sites and proposed a distributed conditional logistic regression (dCLR) algorithm. By employing pairwise conditioning to eliminate site-specific parameters, this novel approach can account for heterogeneity between sites and lead to more robust estimations of regression coefficients. Advancing our research trajectory, we aim to develop an end-to-end framework that enhances the capability to perform specific downstream tasks. In our second body of work, we introduced the Distributed Hospital Comparer framework for EHR-based hospital profiling with distributed data, aiming to benefit the stakeholders (e.g., patients, providers, policymakers, and payers) in the healthcare systems. This framework consists of two major modules: a distributed learning module, namely dGEM (decentralized algorithm for the generalized linear mixed effects model), to address the dilemma of sharing individual patient-level data, and a counterfactual modeling module to tackle the case-mix variation of patients across different hospitals. The validity and applicability of this framework have been demonstrated using a centralized dataset from the U.S. Organ Procurement and Transplantation Network (OPTN) encompassing 149 centers. Subsequently, we applied the framework to a global study involving 12 sites across three countries within the OHDSI network, aiming to investigate variations in hospital performance, measured by COVID-19 mortality, across two pandemic periods. In the third part of our work, motivated by the existence of racial disparities in kidney transplant access and post-transplant outcomes between Non-Hispanic Black (NHB) and Non-Hispanic White (NHW) patients in the United States, we focused on studying the site of care, which is a key factor contributing to the racial disparities. In response, we developed a federated learning framework, named dGEM-disparity (decentralized algorithm for generalized linear mixed effect model for disparity evaluation) with the goal of assessing site-of-care-related racial disparities. This framework consists of two modules: the first module provides accurately estimated common effects and calibrated hospital-specific effects by requiring only aggregated data from each center, and the second adopts a counterfactual modeling approach to assess whether graft failure rates differ if NHB patients were admitted to transplant centers in the same distribution as NHW patients. This framework has been applied to the United States Renal Data System (USRDS) data from 39,043 adult patients across 73 transplant centers.

Advisor
Chen, Yong
Date of degree
2024
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation