Off-Policy Temporal Difference Learning For Robotics And Autonomous Systems

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Electrical & Systems Engineering
Discipline
Subject
Bayesian Inference
Deep Reinforcement Learning
Information Acquisition
Machine Learning
Reinforcement Learning
Robotics
Artificial Intelligence and Robotics
Computer Sciences
Robotics
Funder
Grant number
License
Copyright date
2021-08-31T20:20:00-07:00
Distributor
Related resources
Author
Jeong, Heejin
Contributor
Abstract

Reinforcement learning (RL) is a rapidly advancing field with implications in autonomous vehicles, medicine, finance, along with several other applications. Particularly, off-policy temporal difference (TD) learning, a specific type of RL technique, has been widely used in a variety of autonomous tasks. However, there remain significant challenges that must be overcome before it can be successfully applied to various real-world applications. In this thesis, we specifically address several major challenges in off-policy TD learning. In the first part of the thesis, we introduce an efficient method of learning complex stand-up motion of humanoid robots by Q-learning. Standing up after falling is an essential ability for humanoid robots yet it is difficult to learn flexible stand-up motions for various fallen positions due to the complexity of the task. We reduce sample complexity of learning by applying a clustering method and utilizing the bilateral symmetric feature of humanoid robots. The learned policy is demonstrated in both simulation and on a physical robot. The greedy update of Q-learning, however, often causes overoptimism and instability. In the second part of the thesis, we propose a novel Bayesian approach to Q-learning, called ADFQ, which improves the greedy update issues by providing a principled way of updating Q-values based on uncertainty of Q-belief distributions. The algorithm converges to Q-learning as the uncertainty approaches zero, and its efficient computational complexity enables the algorithm to be extended with a neural network. Both ADFQ and its neural network extension outperform their comparing algorithms by improving the estimation bias and converging faster to optimal Q-values. In the last part of the thesis, we apply off-policy TD methods to solve the active information acquisition problem where an autonomous agent is tasked with acquiring information about targets of interests. Off-policy TD learning provides solutions for classical challenges in this problem -- system model dependence and the difficulty of computing information-theoretic cost functions for a long planning horizon. In particular, we introduce a method of learning a unified policy for in-sight tracking, navigation, and exploration. The policy shows robust behavior for tracking agile and anomalous targets with a partially known target model.

Advisor
Daniel D. Lee
George J. Pappas
Date of degree
2020-01-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation