Date of Award
Doctor of Philosophy (PhD)
Computer and Information Science
Lyle H. Ungar
James C. Gee
Spectral learning algorithms are becoming increasingly popular in data-rich domains, driven in part by recent advances in large scale randomized SVD, and in spectral estimation of Hidden Markov Models. Extensions of these methods lead to statistical estimation algorithms which are not only fast, scalable, and useful on real data sets, but are also provably correct.
Following this line of research, we make two contributions. First, we
propose a set of spectral algorithms for text analysis and natural
language processing. In particular, we propose fast and scalable
spectral algorithms for learning word embeddings -- low dimensional
real vectors (called Eigenwords) that capture the “meaning” of words from their context. Second, we show how similar spectral methods can be applied to analyzing brain images.
State-of-the-art approaches to learning word embeddings are slow to
train or lack theoretical grounding; We propose three spectral
algorithms that overcome these limitations. All three algorithms
harness the multi-view nature of text data i.e. the left and right
context of each word, and share three characteristics:
1). They are fast to train and are scalable.
2). They have strong theoretical properties.
3). They can induce context-specific embeddings i.e. different embedding for “river bank” or “Bank of America”.
They also have lower sample complexity and hence higher statistical
power for rare words. We provide theory which establishes
relationships between these algorithms and optimality criteria for the
estimates they provide. We also perform thorough qualitative and
quantitative evaluation of Eigenwords and demonstrate their superior performance over state-of-the-art approaches.
Next, we turn to the task of using spectral learning methods for brain imaging data.
Methods like Sparse Principal Component Analysis (SPCA), Non-negative Matrix Factorization (NMF) and Independent Component Analysis (ICA) have been used to obtain state-of-the-art accuracies in a variety of problems in machine learning. However, their usage in brain imaging, though increasing, is limited by the fact that they are used as out-of-the-box techniques and are seldom tailored to the domain specific constraints and knowledge pertaining to medical imaging, which leads to difficulties in interpretation of results.
In order to address the above shortcomings, we propose
Eigenanatomy (EANAT), a general framework for sparse matrix factorization. Its goal is to statistically learn the boundaries of
and connections between brain regions by weighing both the data and prior neuroanatomical knowledge.
Although EANAT incorporates some neuroanatomical prior knowledge in the form of connectedness and smoothness constraints, it can still be difficult for clinicians to interpret the results in specific domains where network-specific hypotheses exist. We thus extend EANAT and present a novel framework for prior-constrained sparse decomposition of matrices derived from brain imaging data, called Prior Based Eigenanatomy (p-Eigen). We formulate our solution in terms of a prior-constrained l1 penalized (sparse) principal component analysis. Experimental evaluation confirms that p-Eigen extracts biologically-relevant, patient-specific functional parcels and that it significantly aids classification of Mild Cognitive Impairment when compared to state-of-the-art competing approaches.
Dhillon, Paramveer, "Advances in Spectral Learning with Applications to Text Analysis and Brain Imaging" (2014). Publicly Accessible Penn Dissertations. 1257.