Empowering RCT with Multi-site Multi-source RWD: a Statistical Learning Perspective

Loading...
Thumbnail Image
Degree type
PhD
Graduate group
Epidemiology and Biostatistics
Discipline
Biology
Subject
Funder
Grant number
License
Copyright date
01/01/2025
Distributor
Related resources
Author
Zhang, Dazheng
Contributor
Abstract

Randomized controlled trials (RCTs) are recognized for their ability to minimize confounding through randomization, thereby establishing robust causal inferences in clinical research. However, the stringent protocols and limited sample sizes typical of RCTs can restrict statistical efficiency and the scope of inference. In contrast, real-world data (RWD), such as electronic health records (EHRs), capture broader and more representative populations but introduce challenges including bias, missing data, and the lack of randomization. Integrating these data sources effectively can enhance both statistical efficiency and external validity, yet it requires overcoming issues such as heterogeneity across clinical sites, bias from unmeasured confounding, and model shifts between trial and real-world populations. The overarching goal of this dissertation is to develop and validate robust analytical frameworks that address the limitations of using RWD in federated learning and its integration with RCTs. This work is organized around three interconnected aims. The first aim focuses on the development of one-shot distributed algorithms designed to analyze competing risks data across decentralized distributed research networks, explicitly addressing data heterogeneity to generate high-quality real-world evidence that informs clinical decision-making. In settings where event rates are extremely rare, this aim further introduces a distributed Firth-corrected inference method utilizing a two-round communication strategy to improve estimation accuracy. The second aim addresses the challenge of time-varying effect of unmeasured confounding by proposing a novel negative control-calibrated difference-in-differences (NC-DiD) methodology. By integrating negative control outcomes, this approach systematically detects, quantifies, and corrects for biases, thereby enhancing the validity of causal inferences drawn from EHR data. Its application to evaluate racial and ethnic differences in post-acute sequelae of COVID-19 among pediatric populations demonstrates significant improvements in bias correction and estimation reliability compared to conventional methods. The third aim combines RCTs with RWD to enhance statistical efficiency and accelerate clinical trials. This is achieved through the development of the Negative Control-Calibrated Digital Twin framework, which constructs individualized digital twins from RWD—calibrated via negative control outcomes—to mitigate model shift bias between trial and real-world populations. Empirical validation using neuroimaging data from the SPRINT-MIND trial alongside data from the iSTAGING Consortium confirms significant enhancements in statistical efficiency and robustness. Overall, the contributions of this dissertation provide novel methodological insights and practical tools that advance the integration of RCT and RWD, ultimately improving causal inference and the external validity of clinical research.

Advisor
Chen, Yong, YC
Date of degree
2025
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation