Advances in spectral learning with applications to text analysis and brain imaging

Paramveer Singh Dhillon, University of Pennsylvania


Spectral learning algorithms are becoming increasingly popular in data-rich domains, driven in part by recent advances in large scale randomized SVD, and in spectral estimation of Hidden Markov Models. Extensions of these methods lead to statistical estimation algorithms which are not only fast, scalable, and useful on real data sets, but are also provably correct. Following this line of research, we make two contributions. First, we propose a set of spectral algorithms for text analysis and natural language processing. In particular, we propose fast and scalable spectral algorithms for learning word embeddings -- low dimensional real vectors (called Eigenwords) that capture the “meaning” of words from their context. Second, we show how similar spectral methods can be applied to analyzing brain images. State-of-the-art approaches to learning word embeddings are slow to train or lack theoretical grounding; We propose three spectral algorithms that overcome these limitations. All three algorithms harness the multi-view nature of text data i.e. the left and right context of each word, and share three characteristics: 1). They are fast to train and are scalable. 2). They have strong theoretical properties. 3). They can induce context-specific embeddings i.e. different embedding for “river bank” or “Bank of America”. They also have lower sample complexity and hence higher statistical power for rare words. We provide theory which establishes relationships between these algorithms and optimality criteria for the estimates they provide. We also perform thorough qualitative and quantitative evaluation of Eigenwords and demonstrate their superior performance over state-of-the-art approaches. Next, we turn to the task of using spectral learning methods for brain imaging data. Methods like Sparse Principal Component Analysis (SPCA), Non-negative Matrix Factorization (NMF) and Independent Component Analysis (ICA) have been used to obtain state-of-the-art accuracies in a variety of problems in machine learning. However, their usage in brain imaging, though increasing, is limited by the fact that they are used as out-of-the-box techniques and are seldom tailored to the domain specific constraints and knowledge pertaining to medical imaging, which leads to difficulties in interpretation of results. In order to address the above shortcomings, we propose Eigenanatomy (EANAT), a general framework for sparse matrix factorization. Its goal is to statistically learn the boundaries of and connections between brain regions by weighing both the data and prior neuroanatomical knowledge. Although EANAT incorporates some neuroanatomical prior knowledge in the form of connectedness and smoothness constraints, it can still be difficult for clinicians to interpret the results in specific domains where network-specific hypotheses exist. We thus extend EANAT and present a novel framework for prior-constrained sparse decomposition of matrices derived from brain imaging data, called Prior Based Eigenanatomy (p-Eigen). We formulate our solution in terms of a prior-constrained ℓ1 penalized (sparse) principal component analysis. Experimental evaluation confirms that p-Eigen extracts biologically-relevant, patient-specific functional parcels and that it significantly aids classification of Mild Cognitive Impairment when compared to state-of-the-art competing approaches.

Subject Area

Statistics|Medical imaging|Computer science

Recommended Citation

Dhillon, Paramveer Singh, "Advances in spectral learning with applications to text analysis and brain imaging" (2014). Dissertations available from ProQuest. AAI3634131.