Ungar, Lyle H.
Now showing 1 - 10 of 23
PublicationCoRAL: Predicting Non-Coding RNAs from Small RNA-Sequencing Data(2013-08-01) Leung, Yuk Y; Ryvkin, Paul; Ungar, Lyle H; Gregory, Brian D; Wang, Li-San; Leung, Yuk Y; Ryvkin, Paul; Ungar, Lyle H; Gregory, Brian D; Wang, Li-SanThe surprising observation that virtually the entire human genome is transcribed means we know little about the function of many emerging classes of RNAs, except their astounding diversities. Traditional RNA function prediction methods rely on sequence or alignment information, which are limited in their abilities to classify the various collections of non-coding RNAs (ncRNAs). To address this, we developed Classification of RNAs by Analysis of Length (CoRAL), a machine learning-based approach for classification of RNA molecules. CoRAL uses biologically interpretable features including fragment length and cleavage specificity to distinguish between different ncRNA populations. We evaluated CoRAL using genome-wide small RNA sequencing data sets from four human tissue types and were able to classify six different types of RNAs with ∼80% cross-validation accuracy. Analysis by CoRAL revealed that microRNAs, small nucleolar and transposon-derived RNAs are highly discernible and consistent across all human tissue types assessed, whereas long intergenic ncRNAs, small cytoplasmic RNAs and small nuclear RNAs show less consistent patterns. The ability to reliably annotate loci across tissue types demonstrates the potential of CoRAL to characterize ncRNAs using small RNA sequencing data in less well-characterized organisms. PublicationStreamwise Feature Selection(2006-09-01) Stine, Robert A; Ungar, Lyle H.; Stine, Robert A; Ungar, Lyle H.In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overfitting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overfitting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-fit in the limit of infinite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. PublicationTowards Structural Logistic Regression: Combining Relational and Statistical Learning(2002-07-23) Ungar, Lyle H; Ungar, Lyle H; Lawrence, Steve; Pennock, David MInductive logic programming (ILP) techniques are useful for analyzing data in multi-table relational databases. Learned rules can potentially discover relationships that are not obvious in "flattened" data. Statistical learners, on the other hand, are generally not constructed to search relational data; they expect to be presented with a single table containing a set of feature candidates. However, statistical learners often yield more accurate models than the logical forms of ILP, and can better handle certain types of data, such as counts. We propose a new approach which integrates structure navigation from ILP with regression modeling. Our approach propositionalizes the first-order rules at each step of ILP's relational structure search, generating features for potential inclusion in a regression model. Ideally, feature generation by ILP and feature selection by stepwise regression should be integrated into a single loop. Preliminary results for scientific literature classification are presented using a relational form of the data extracted by ResearchIndex (formerly CiteSeer). We use FOIL and logistic regression as our ILP and statistical components (decoupled at this stage). Word counts and citation-based features learned with FOIL are modeled together by logistic regression. The combination often significantly improves performance when high precision classification is desired. PublicationPennAspect: Two-Way Aspect Model Implementation(2001-01-01) Schein, Andrew I; Ungar, Lyle H; Ungar, Lyle HThe two-way aspect model is a latent class statistical mixture model for performing soft clustering of co-occurrence data observations. It acts on data such as document/word pairs (words occurring in documents) or movie/people pairs (people see certain movies) to produce their joint distribution estimate. This document describes our software immplementation of the aspect model available under GNU Public License (included with the distribution). We call this package PennAspect. The distribution is packaged as Java source and class files. The software comes with no guarantees of any kind. We welcome user feedback and comments. To download PennAspect, visit: http://www.cis.upenn.edu/datamining/software_dist/PennAspect/index.html. PublicationA-Optimality for Active Learning of Logistic Regression Classifiers(2004-01-01) Schein, Andrew I; Ungar, Lyle; Schein, Andrew I; Ungar, LyleOver the last decade there has been growing interest in pool-based active learning techniques, where instead of receiving an i.i.d. sample from a pool of unlabeled data, a learner may take an active role in selecting examples from the pool. Queries to an oracle (a human annotator in most applications) provide label information for the selected observations, but at a cost. The challenge is to end up with a model that provides the best possible generalization error at the least cost. Popular methods such as uncertainty sampling often work well, but sometimes fail badly. We take the A-optimality criterion used in optimal experimental design, and extend it so that it can be used for pool-based active learning of logistic regression classifiers. A-optimality has attractive theoretical properties, and empirical evaluation confirms that it offers a more robust approach to active learning for logistic regression than alternatives. PublicationMulti-View Learning of Word Embeddings via CCA(2011-01-01) Dhillon, Paramveer S.; Ungar, Lyle; Ungar, LyleNeurRecently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classifiers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-specific word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe- art performance on named entity recognition (NER) and chunking problems. PublicationProbabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments(2001-08-02) Ungar, Lyle H; Ungar, Lyle H; Pennock, David M; Lawrence, SteveRecommender systems leverage product and community information to target products to consumers. Researchers have developed collaborative recommenders, content-based recommenders, and a few hybrid systems. We propose a unified probabilistic framework for merging collaborative and content-based recommendations. We extend Hofmann’s (1999) aspect model to incorporate three-way co-occurrence data among users, items, and item content. The relative influence of collaboration data versus content data is not imposed as an exogenous parameter, but rather emerges naturally from the given data sources. However, global probabilistic models coupled with standard EM learning algorithms tend to drastically overfit in the sparse data situations typical of recommendation applications. We show that secondary content information can often be used to overcome sparsity. Experiments on data from the ResearchIndex library of Computer Science publications show that appropriate mixture models incorporating secondary data produce significantly better quality recommenders than k-nearest neighbors (k-NN). Global probabilistic models also allow more general inferences than local methods like k-NN. PublicationPatterns of Sequence Conservation in Presynaptic Neural Genes(2006-11-10) Hadley, Dexter; Murphy, Tara; Valladares, Otto; Ungar, Lyle H.; Ungar, Lyle H.; Kim, Junhyong; Bucan, MajaBackground: The neuronal synapse is a fundamental functional unit in the central nervous system of animals. Because synaptic function is evolutionarily conserved, we reasoned that functional sequences of genes and related genomic elements known to play important roles in neurotransmitter release would also be conserved. Results: Evolutionary rate analysis revealed that presynaptic proteins evolve slowly, although some members of large gene families exhibit accelerated evolutionary rates relative to other family members. Comparative sequence analysis of 46 megabases spanning 150 presynaptic genes identified more than 26,000 elements that are highly conserved in eight vertebrate species, as well as a small subset of sequences (6%) that are shared among unrelated presynaptic genes. Analysis of large gene families revealed that upstream and intronic regions of closely related family members are extremely divergent. We also identified 504 exceptionally long conserved elements (≥360 base pairs, ≥80% pair-wise identity between human and other mammals) in intergenic and intronic regions of presynaptic genes. Many of these elements form a highly stable stem-loop RNA structure and consequently are candidates for novel regulatory elements, whereas some conserved noncoding elements are shown to correlate with specific gene expression profiles. The SynapseDB online database integrates these findings and other functional genomic resources for synaptic genes. Conclusion: Highly conserved elements in nonprotein coding regions of 150 presynaptic genes represent sequences that may be involved in the transcriptional or post-transcriptional regulation of these genes. Furthermore, comparative sequence analysis will facilitate selection of genes and noncoding sequences for future functional studies and analysis of variation studies in neurodevelopmental and psychiatric disorders. PublicationPricing Price Information in E-Commerce(2001-10-14) Ungar, Lyle H; Ungar, Lyle HShopbots and Internet sites that help users locate the best price for a product are changing the way people shop by providing valuable information on goods and services. This paper presents a first attempt to measure the value of one piece of information: the price charged for goods and services. We first establish a theoretical limit to the value of price information for the first seller in a market that decides to sell price information to a shopbot and quantify the revenues that the seller can expect to receive. We then proceed to discuss whether and how much of this theoretical value can actually be realized in equilibrium settings. PublicationUnsupervised Distance Metric Learning Using Predictability(2008-06-13) Gupta, Abhishek A.; Foster, Dean P.; Ungar, Lyle H.; Gupta, Abhishek A.; Foster, Dean P.; Ungar, Lyle H.Distance-based learning methods, like clustering and SVMs, are dependent on good distance metrics. This paper does unsupervised metric learning in the context of clustering. We seek transformations of data which give clean and well separated clusters where clean clusters are those for which membership can be accurately predicted. The transformation (hence distance metric) is obtained by minimizing the blur ratio, which is defined as the ratio of the within cluster variance divided by the total data variance in the transformed space. For minimization we propose an iterative procedure, Clustering Predictions of Cluster Membership (CPCM). CPCM alternately (a) predicts cluster memberships (e.g., using linear regression) and (b) clusters these predictions (e.g., using k-means). With linear regression and k-means, this algorithm is guaranteed to converge to a fixed point. The resulting clusters are invariant to linear transformations of original features, and tend to eliminate noise features by driving their weights to zero.