Ungar, Lyle H.

Email Address
Research Projects
Organizational Units
Research Interests

Search Results

Now showing 1 - 10 of 23
  • Publication
    A-Optimality for Active Learning of Logistic Regression Classifiers
    (2004-01-01) Schein, Andrew I; Ungar, Lyle
    Over the last decade there has been growing interest in pool-based active learning techniques, where instead of receiving an i.i.d. sample from a pool of unlabeled data, a learner may take an active role in selecting examples from the pool. Queries to an oracle (a human annotator in most applications) provide label information for the selected observations, but at a cost. The challenge is to end up with a model that provides the best possible generalization error at the least cost. Popular methods such as uncertainty sampling often work well, but sometimes fail badly. We take the A-optimality criterion used in optimal experimental design, and extend it so that it can be used for pool-based active learning of logistic regression classifiers. A-optimality has attractive theoretical properties, and empirical evaluation confirms that it offers a more robust approach to active learning for logistic regression than alternatives.
  • Publication
    Unsupervised Distance Metric Learning Using Predictability
    (2008-06-13) Gupta, Abhishek A.; Foster, Dean P.; Ungar, Lyle H.
    Distance-based learning methods, like clustering and SVMs, are dependent on good distance metrics. This paper does unsupervised metric learning in the context of clustering. We seek transformations of data which give clean and well separated clusters where clean clusters are those for which membership can be accurately predicted. The transformation (hence distance metric) is obtained by minimizing the blur ratio, which is defined as the ratio of the within cluster variance divided by the total data variance in the transformed space. For minimization we propose an iterative procedure, Clustering Predictions of Cluster Membership (CPCM). CPCM alternately (a) predicts cluster memberships (e.g., using linear regression) and (b) clusters these predictions (e.g., using k-means). With linear regression and k-means, this algorithm is guaranteed to converge to a fixed point. The resulting clusters are invariant to linear transformations of original features, and tend to eliminate noise features by driving their weights to zero.
  • Publication
    Spectral dimensionality reduction for HMMs
    (2012-03-29) Foster, Dean P; Rodu, Jordan; Ungar, Lyle
    Hidden Markov Models (HMMs) can be accurately approximated using co-occurrence frequencies of pairs and triples of observations by using a fast spectral method Hsu et al. (2009) in contrast to the usual slow methods like EM or Gibbs sampling. We provide a new spectral method which significantly reduces the number of model parameters that need to be estimated, and generates a sample complexity that does not depend on the size of the observation vocabulary. We present an elementary proof giving bounds on the relative accuracy of probability estimates from our model. (Correlaries show our bounds can be weakened to provide either L1 bounds or KL bounds which provide easier direct comparisons to previous work.) Our theorem uses conditions that are checkable from the data, instead of putting conditions on the unobservable Markov transition matrix.
  • Publication
    Towards Structural Logistic Regression: Combining Relational and Statistical Learning
    (2002-07-23) Popescul, Alexandrin; Ungar, Lyle H; Lawrence, Steve; Pennock, David M
    Inductive logic programming (ILP) techniques are useful for analyzing data in multi-table relational databases. Learned rules can potentially discover relationships that are not obvious in "flattened" data. Statistical learners, on the other hand, are generally not constructed to search relational data; they expect to be presented with a single table containing a set of feature candidates. However, statistical learners often yield more accurate models than the logical forms of ILP, and can better handle certain types of data, such as counts. We propose a new approach which integrates structure navigation from ILP with regression modeling. Our approach propositionalizes the first-order rules at each step of ILP's relational structure search, generating features for potential inclusion in a regression model. Ideally, feature generation by ILP and feature selection by stepwise regression should be integrated into a single loop. Preliminary results for scientific literature classification are presented using a relational form of the data extracted by ResearchIndex (formerly CiteSeer). We use FOIL and logistic regression as our ILP and statistical components (decoupled at this stage). Word counts and citation-based features learned with FOIL are modeled together by logistic regression. The combination often significantly improves performance when high precision classification is desired.
  • Publication
    Pricing Price Information in E-Commerce
    (2001-10-14) Markopoulos, Panos M.; Ungar, Lyle H
    Shopbots and Internet sites that help users locate the best price for a product are changing the way people shop by providing valuable information on goods and services. This paper presents a first attempt to measure the value of one piece of information: the price charged for goods and services. We first establish a theoretical limit to the value of price information for the first seller in a market that decides to sell price information to a shopbot and quantify the revenues that the seller can expect to receive. We then proceed to discuss whether and how much of this theoretical value can actually be realized in equilibrium settings.
  • Publication
    Statistical Relational Learning for Document Mining
    (2003-11-19) Popescul, Alexandrin; Ungar, Lyle H; Lawrence, Steve; Pennock, David M.
    A major obstacle to fully integrated deployment of many data mining algorithms is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. We propose an integrated approach to statistical modeling from relational databases. We structure the search space based on "refinement graphs", which are widely used in inductive logic programming for learning logic descriptions. The use of statistics allows us to extend the search space to include richer set of features, including many which are not boolean. Search and model selection are integrated into a single process, allowing information criteria native to the statistical model, for example logistic regression, to make feature selection decisions in a step-wise manner. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. The resulting classifier can be used to recommend where to publish articles.
  • Publication
    Patterns of Sequence Conservation in Presynaptic Neural Genes
    (2006-11-10) Hadley, Dexter; Murphy, Tara; Valladares, Otto; Hannenhalli, Sridhar; Ungar, Lyle H.; Kim, Junhyong; Bucan, Maja
    Background: The neuronal synapse is a fundamental functional unit in the central nervous system of animals. Because synaptic function is evolutionarily conserved, we reasoned that functional sequences of genes and related genomic elements known to play important roles in neurotransmitter release would also be conserved. Results: Evolutionary rate analysis revealed that presynaptic proteins evolve slowly, although some members of large gene families exhibit accelerated evolutionary rates relative to other family members. Comparative sequence analysis of 46 megabases spanning 150 presynaptic genes identified more than 26,000 elements that are highly conserved in eight vertebrate species, as well as a small subset of sequences (6%) that are shared among unrelated presynaptic genes. Analysis of large gene families revealed that upstream and intronic regions of closely related family members are extremely divergent. We also identified 504 exceptionally long conserved elements (≥360 base pairs, ≥80% pair-wise identity between human and other mammals) in intergenic and intronic regions of presynaptic genes. Many of these elements form a highly stable stem-loop RNA structure and consequently are candidates for novel regulatory elements, whereas some conserved noncoding elements are shown to correlate with specific gene expression profiles. The SynapseDB online database integrates these findings and other functional genomic resources for synaptic genes. Conclusion: Highly conserved elements in nonprotein coding regions of 150 presynaptic genes represent sequences that may be involved in the transcriptional or post-transcriptional regulation of these genes. Furthermore, comparative sequence analysis will facilitate selection of genes and noncoding sequences for future functional studies and analysis of variation studies in neurodevelopmental and psychiatric disorders.
  • Publication
    PennAspect: Two-Way Aspect Model Implementation
    (2001-01-01) Schein, Andrew I; Popescul, Alexandrin; Ungar, Lyle H
    The two-way aspect model is a latent class statistical mixture model for performing soft clustering of co-occurrence data observations. It acts on data such as document/word pairs (words occurring in documents) or movie/people pairs (people see certain movies) to produce their joint distribution estimate. This document describes our software immplementation of the aspect model available under GNU Public License (included with the distribution). We call this package PennAspect. The distribution is packaged as Java source and class files. The software comes with no guarantees of any kind. We welcome user feedback and comments. To download PennAspect, visit: http://www.cis.upenn.edu/datamining/software_dist/PennAspect/index.html.
  • Publication
    Cluster-based Concept Invention for Statistical Relational Learning
    (2004-08-22) Popescul, Alexandrin; Ungar, Lyle H.
    We use clustering to derive new relations which augment database schema used in automatic generation of predictive features in statistical relational learning. Entities derived from clusters increase the expressivity of feature spaces by creating new first-class concepts which contribute to the creation of new features. For example, in CiteSeer, papers can be clustered based on words or citations giving "topics", and authors can be clustered based on documents they co-author giving "communities". Such cluster-derived concepts become part of more complex feature expressions. Out of the large number of generated features, those which improve predictive accuracy are kept in the model, as decided by statistical feature selection criteria. We present results demonstrating improved accuracy on two tasks, venue prediction and link prediction, using CiteSeer data.
  • Publication
    Active Learning for Logistic Regression: An Evaluation
    (2007-10-01) Schein, Andrew I; Ungar, Lyle H.
    Which active learning methods can we expect to yield good performance in learning binary and multi-category logistic regression classifiers? Addressing this question is a natural first step in providing robust solutions for active learning across a wide variety of exponential models including maximum entropy, generalized linear, log-linear, and conditional random field models. For the logistic regression model we re-derive the variance reduction method known in experimental design circles as 'A-optimality.' We then run comparisons against different variations of the most widely used heuristic schemes: query by committee and uncertainty sampling, to discover which methods work best for different classes of problems and why. We find that among the strategies tested, the experimental design methods are most likely to match or beat a random sample baseline. The heuristic alternatives produced mixed results, with an uncertainty sampling variant called margin sampling providing the most promising performance at very low computational cost. Computational running times of the experimental design methods were a bottleneck to the evaluations. Meanwhile, evaluation of the heuristic methods lead to an accumulation of negative results. Such results demonstrate a need for improved active learning methods that will provide reliable performance at a reasonable computational cost.