Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Epidemiology & Biostatistics

First Advisor

Qi . Long


Support vector machines (SVM) is a popular classification method for analysis of high dimensional data such as genomics data. Recently, new SVM methods have been developed to achieve variable selection through either frequentist regularization or Bayesian shrinkage. The Bayesian framework provides a probabilistic interpretation for SVM and allows direct uncertainty quantification. In this dissertation, we develop four knowledge-guided SVM methods for the analysis of high dimensional data.

In Chapter 1, I first review the theory of SVM and existing methods for incorporating the prior knowledge, represented bby graphs into SVM. Second, I review the terminology on variable selection and limitations of the existing methods for SVM variable selection. Last, I introduce some Bayesian variable selection techniques as well as Markov chain

Monte Carlo (MCMC) algorithms .

In Chapter 2, we develop a new Bayesian SVM method that enables variable selection guided by structural information among predictors, e.g, biological pathways among genes. This method uses a spike and slab prior for feature selection combined with an Ising prior for incorporating structural information. The performance of the proposed method is evaluated in comparison with existing SVM methods in terms of prediction and feature selection in extensive simulations. Furthermore, the proposed method is illustrated in analysis of genomic data from a cancer study, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features.

The model developed in Chapter 2 might suffer from the issue of phase transition \citep{li2010bayesian} when the number of variables becomes extremely large. In Chapter 3, we propose another Bayesian SVM method that assigns an adaptive structured shrinkage prior to the coefficients and the graph information is incorporated via the hyper-priors imposed on the precision matrix of the log-transformed shrinkage parameters. This method is shown to outperform the method in Chapter 2 in both simulations and real data analysis..

In Chapter 4, to relax the linearity assumption in chapter 2 and 3, we develop a novel knowledge-guided Bayesian non-linear SVM. The proposed method uses a diagonal matrix with ones representing feature selected and zeros representing feature unselected, and combines with the Ising prior to perform feature selection. The performance of our method is evaluated and compared with several penalized linear SVM and the standard kernel SVM method in terms of prediction and feature selection in extensive simulation settings. Also, analyses of genomic data from a cancer study show that our method yields a more accurate prediction model for patient survival and reveals biologically more meaningful results than the existing methods.

In Chapter 5, we extend the work of Chapter 4 and use a joint model to identify the relevant features and learn the structural information among them simultaneously. This model does not require that the structural information among the predictors is known, which is more powerful when the prior knowledge about pathways is limited or inaccurate. We demonstrate that our method outperforms the method developed in Chapter 4 when the prior knowledge is partially true or inaccurate in simulations and illustrate our proposed model with an application to a gliobastoma data set.

In Chapter 6, we propose some future works including extending our methods to more general types of outcomes such as categorical or continuous variables.

Included in

Biostatistics Commons