Blackwell online learning for markov decision processes

Tao Li, Guanze Peng, Quanyan Zhu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Ahstract-This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, Based on the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell Q-learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.

Original languageEnglish (US)
Title of host publication2021 55th Annual Conference on Information Sciences and Systems, CISS 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665412681
DOIs
StatePublished - Mar 24 2021
Event55th Annual Conference on Information Sciences and Systems, CISS 2021 - Baltimore, United States
Duration: Mar 24 2021Mar 26 2021

Publication series

Name2021 55th Annual Conference on Information Sciences and Systems, CISS 2021

Conference

Conference55th Annual Conference on Information Sciences and Systems, CISS 2021
CountryUnited States
CityBaltimore
Period3/24/213/26/21

Keywords

  • Blackwell approachability
  • No-regret learning
  • Online optimization
  • Reinforcement learning

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Information Systems and Management

Fingerprint Dive into the research topics of 'Blackwell online learning for markov decision processes'. Together they form a unique fingerprint.

Cite this