On the model-based stochastic value gradient for continuous reinforcement learning

Brandon Amos, Samuel Stanton, Denis Yarats, Andrew Gordon Wilson

Research output: Contribution to journalConference articlepeer-review

Abstract

For over a decade, model-based reinforcement learning has been seen as a way to leverage control-based domain knowledge to improve the sample-efficiency of reinforcement learning agents. While model-based agents are conceptually appealing, their policies tend to lag behind those of model-free agents in terms of final reward, especially in non-trivial environments. In response, researchers have proposed model-based agents with increasingly complex components, from ensembles of probabilistic dynamics models, to heuristics for mitigating model error. In a reversal of this trend, we show that simple model-based agents can be derived from existing ideas that not only match, but outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. We find that a model-free soft value estimate for policy evaluation and a model-based stochastic value gradient for policy improvement is an effective combination, achieving state-of-the-art results on a high-dimensional humanoid control task, which most model-based agents are unable to solve. Our findings suggest that model-based policy evaluation deserves closer attention.

Original languageEnglish (US)
Pages (from-to)6-20
Number of pages15
JournalProceedings of Machine Learning Research
Volume144
StatePublished - 2021
Event3rd Annual Conference on Learning for Dynamics and Control, L4DC 2021 - Virtual, Online, Switzerland
Duration: Jun 7 2021Jun 8 2021

Keywords

  • model-based control
  • Reinforcement learning
  • value gradient

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'On the model-based stochastic value gradient for continuous reinforcement learning'. Together they form a unique fingerprint.

Cite this