Stochastic Linear Optimization Under Bandit Feedback

Loading...
Thumbnail Image
Penn collection
Statistics Papers
Degree type
Discipline
Subject
Computer Sciences
Statistics and Probability
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Dani, Varsha
Hayes, Thomas P
Kakade, Sham M
Contributor
Abstract

In the classical stochastic k-armed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algorithm and the optimal cost. In the linear optimization version of this problem (first considered by Auer [2002]), we view the arms as vectors in Rn, and require that the costs be linear functions of the chosen vector. As before, it is assumed that the cost functions are sampled independently from an unknown distribution. In this setting, the goal is to find algorithms whose running time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k, which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(T). In general though, the optimal regret is Θ∗ ( √ T) — our lower bounds rule out the possibility of obtaining polylog(T) rates in general. We present two variants of an algorithm based on the idea of “upper confidence bounds.” The first, due to Auer [2002], but not fully analyzed, obtains regret whose dependence on n and T are both essentially optimal, but which may be computationally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an intersection of half-spaces), but gives up a factor of √ n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time.

Advisor
Date of presentation
2008-01-01
Conference name
Statistics Papers
Conference dates
2023-05-17T15:30:15.000
Conference location
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation
Collection