Date of this Version
Advances in Neural Information Processing Systems
We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.
Bagnell, J. A., Kakade, S., Ng, A. Y., & Schneider, J. G. (2003). Policy Search by Dynamic Programming. Advances in Neural Information Processing Systems, 16 Retrieved from https://repository.upenn.edu/statistics_papers/465
Date Posted: 27 November 2017