Bayesian Nonparametric Methods For Causal Inference And Prediction

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology & Biostatistics
Discipline
Subject
Bayesian Additive Regression Trees
Bayesian nonparametrics
Causal Inference
Dirichlet Process
Prediction
Biostatistics
Funder
Grant number
License
Copyright date
2018-09-28T20:17:00-07:00
Distributor
Related resources
Contributor
Abstract

In this thesis we present novel approaches to regression and causal inference using popular Bayesian nonparametric methods. Bayesian Additive Regression Trees (BART) is a Bayesian machine learning algorithm in which the conditional distribution is modeled as a sum of regression trees. We extend BART into a semiparametric generalized linear model framework so that a portion of the covariates are modeled nonparametrically using BART and a subset of the covariates have parametric form. This presents an attractive option for research in which only a few covariates are of scientific interest but there are other covariates must be controlled for. Under certain causal assumptions, this model can be used as a structural mean model. We demonstrate this method by examining the effect of initiating certain antiretroviral medications has on mortality among HIV/HCV coinfected subjects. In later chapters, we propose a joint model for a continuous longitudinal outcome and baseline covariates using penalized splines and an enriched Dirichlet process (EDP) prior. This joint model decomposes into local linear mixed models for the outcome given the covariates and marginals for the covariates. The EDP prior that is placed on the regression parameters and the parameters on the covariates induces clustering among subjects determined by similarity in their regression parameters and nested within those clusters, sub-clusters based on similarity in the covariate space. When there are a large number of covariates, we find improved prediction over the same model with Dirichlet process (DP) priors. Since the model clusters based on regression parameters, this model also serves as a functional clustering algorithm where one does not have to choose the number of clusters beforehand. We use the method to estimate incidence rates of diabetes when longitudinal laboratory values from electronic health records are used to augment diagnostic codes for outcome identification. We later extend this work by using our EDP model in a causal inference setting using the parametric g-formula. We demonstrate this using electronic health record data consisting of subjects initiating second generation antipsychotics.

Advisor
Jason A. Roy
Date of degree
2017-01-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation