Active learning for logistic regression

Andrew Ian Schein, University of Pennsylvania


Which active learning methods can we expect to yield good performance in learning logistic regression classifiers? Addressing this question is a natural first step in providing robust solutions for active learning across a wide variety of exponential models including maximum entropy, generalized linear, loglinear, and conditional random field models. We extend previous work on active learning using explicit objective functions by developing a framework for implementing a wide class of loss functions for active learning of logistic regression, including variance (A-optimality) and log loss reduction. We then run comparisons against the most widely used heuristic schemes: query by committee and uncertainty sampling, to discover which methods work best for different classes of problems and why. Our empirical evaluations are the largest effort to date to evaluate explicit objective function methods in active learning. We employed ten data sets in the evaluation from domains such as image recognition and document classification. The data sets vary in number of categories from 2 to 26 and have as many as 6,191 predictors. This work establishes the benefits of these often cited (but rarely used) strategies, and counters the claim that experimental design methods are too computationally complex to run on interesting data sets. The two loss functions were the only methods we tested that always performed at least as well as a randomly selected training set. The same data were used to evaluate several heuristic methods including variants of uncertainty sampling and query by committee. Uncertainty sampling was tested using two different measures of uncertainty: Shannon entropy and margin size. Margin-based uncertainty sampling was found to be superior; however, both methods perform worse than random sampling at times. We show that these failures to match random sampling can be caused by predictor space regions of varying noise or model mismatch. The various heuristics produced mixed results overall in the evaluation, and it is impossible to select one as particularly better than the others when using classifier accuracy as the sole criterion for performance. Margin sampling is the favored approach when computational time is considered along with accuracy.

Subject Area

Computer science|Artificial intelligence

Recommended Citation

Schein, Andrew Ian, "Active learning for logistic regression" (2005). Dissertations available from ProQuest. AAI3197737.