Empowering RCT with Multi-site Multi-source RWD: a Statistical Learning Perspective
Degree type
Graduate group
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Randomized controlled trials (RCTs) are recognized for their ability to minimize confounding through randomization, thereby establishing robust causal inferences in clinical research. However, the stringent protocols and limited sample sizes typical of RCTs can restrict statistical efficiency and the scope of inference. In contrast, real-world data (RWD), such as electronic health records (EHRs), capture broader and more representative populations but introduce challenges including bias, missing data, and the lack of randomization. Integrating these data sources effectively can enhance both statistical efficiency and external validity, yet it requires overcoming issues such as heterogeneity across clinical sites, bias from unmeasured confounding, and model shifts between trial and real-world populations. The overarching goal of this dissertation is to develop and validate robust analytical frameworks that address the limitations of using RWD in federated learning and its integration with RCTs. This work is organized around three interconnected aims. The first aim focuses on the development of one-shot distributed algorithms designed to analyze competing risks data across decentralized distributed research networks, explicitly addressing data heterogeneity to generate high-quality real-world evidence that informs clinical decision-making. In settings where event rates are extremely rare, this aim further introduces a distributed Firth-corrected inference method utilizing a two-round communication strategy to improve estimation accuracy. The second aim addresses the challenge of time-varying effect of unmeasured confounding by proposing a novel negative control-calibrated difference-in-differences (NC-DiD) methodology. By integrating negative control outcomes, this approach systematically detects, quantifies, and corrects for biases, thereby enhancing the validity of causal inferences drawn from EHR data. Its application to evaluate racial and ethnic differences in post-acute sequelae of COVID-19 among pediatric populations demonstrates significant improvements in bias correction and estimation reliability compared to conventional methods. The third aim combines RCTs with RWD to enhance statistical efficiency and accelerate clinical trials. This is achieved through the development of the Negative Control-Calibrated Digital Twin framework, which constructs individualized digital twins from RWD—calibrated via negative control outcomes—to mitigate model shift bias between trial and real-world populations. Empirical validation using neuroimaging data from the SPRINT-MIND trial alongside data from the iSTAGING Consortium confirms significant enhancements in statistical efficiency and robustness. Overall, the contributions of this dissertation provide novel methodological insights and practical tools that advance the integration of RCT and RWD, ultimately improving causal inference and the external validity of clinical research.