Empowering RCT with Multi-site Multi-source RWD: a Statistical Learning Perspective

Zhang, Dazheng

Empowering RCT with Multi-site Multi-source RWD: a Statistical Learning Perspective

Files

Zhang_upenngdas_0175C_16959.pdf (4.86 MB)

Degree type

PhD

Graduate group

Epidemiology and Biostatistics

Discipline

Biology

Copyright date

01/01/2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61317

View all metadata

Author

Zhang, Dazheng

Abstract

Randomized controlled trials (RCTs) are recognized for their ability to minimize confounding through randomization, thereby establishing robust causal inferences in clinical research. However, the stringent protocols and limited sample sizes typical of RCTs can restrict statistical efficiency and the scope of inference. In contrast, real-world data (RWD), such as electronic health records (EHRs), capture broader and more representative populations but introduce challenges including bias, missing data, and the lack of randomization. Integrating these data sources effectively can enhance both statistical efficiency and external validity, yet it requires overcoming issues such as heterogeneity across clinical sites, bias from unmeasured confounding, and model shifts between trial and real-world populations. The overarching goal of this dissertation is to develop and validate robust analytical frameworks that address the limitations of using RWD in federated learning and its integration with RCTs. This work is organized around three interconnected aims. The first aim focuses on the development of one-shot distributed algorithms designed to analyze competing risks data across decentralized distributed research networks, explicitly addressing data heterogeneity to generate high-quality real-world evidence that informs clinical decision-making. In settings where event rates are extremely rare, this aim further introduces a distributed Firth-corrected inference method utilizing a two-round communication strategy to improve estimation accuracy. The second aim addresses the challenge of time-varying effect of unmeasured confounding by proposing a novel negative control-calibrated difference-in-differences (NC-DiD) methodology. By integrating negative control outcomes, this approach systematically detects, quantifies, and corrects for biases, thereby enhancing the validity of causal inferences drawn from EHR data. Its application to evaluate racial and ethnic differences in post-acute sequelae of COVID-19 among pediatric populations demonstrates significant improvements in bias correction and estimation reliability compared to conventional methods. The third aim combines RCTs with RWD to enhance statistical efficiency and accelerate clinical trials. This is achieved through the development of the Negative Control-Calibrated Digital Twin framework, which constructs individualized digital twins from RWD—calibrated via negative control outcomes—to mitigate model shift bias between trial and real-world populations. Empirical validation using neuroimaging data from the SPRINT-MIND trial alongside data from the iSTAGING Consortium confirms significant enhancements in statistical efficiency and robustness. Overall, the contributions of this dissertation provide novel methodological insights and practical tools that advance the integration of RCT and RWD, ultimately improving causal inference and the external validity of clinical research.

Advisor

Chen, Yong, YC

Date of degree

2025

Collection

Dissertations and Theses