On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Yiming Zhang, Keith W. Ross

    Research output: Chapter in Book/Report/Conference proceedingConference contribution


    We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 38th International Conference on Machine Learning, ICML 2021
    PublisherML Research Press
    Number of pages11
    ISBN (Electronic)9781713845065
    StatePublished - 2021
    Event38th International Conference on Machine Learning, ICML 2021 - Virtual, Online
    Duration: Jul 18 2021Jul 24 2021

    Publication series

    NameProceedings of Machine Learning Research
    ISSN (Electronic)2640-3498


    Conference38th International Conference on Machine Learning, ICML 2021
    CityVirtual, Online

    ASJC Scopus subject areas

    • Artificial Intelligence
    • Software
    • Control and Systems Engineering
    • Statistics and Probability


    Dive into the research topics of 'On-Policy Deep Reinforcement Learning for the Average-Reward Criterion'. Together they form a unique fingerprint.

    Cite this