TY - GEN
T1 - Striving for simplicity and performance in off-policy drl
T2 - 37th International Conference on Machine Learning, ICML 2020
AU - Wang, Che
AU - Wu, Yanqiu
AU - Vuong, Quan
AU - Ross, Keith
N1 - Publisher Copyright:
© 2020 by the Authors.
PY - 2020
Y1 - 2020
N2 - We aim to develop off-policy DRL algorithms that not only exceed state-of-The-Art performance but are also simple and minimalistic. For standard continuous control benchmarks, Soft Actor-Critic (SAC), which employs entropy maximization, currently provides state-of-The-Art performance. We frst demonstrate that the entropy term in SAC addresses action saturation due to the bounded nature of the action spaces, with this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. We show that both approaches can match SAC s sample effciency performance without the need of entropy maximization, we then propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. Extensive experimental results demonstrate that our proposed sampling scheme leads to state of the art sample effciency on challenging continuous control tasks. We combine all of our fndings into one simple algorithm, which we call Streamlined Off Policy with Emphasizing Recent Experience, for which we provide robust public-domain code.
AB - We aim to develop off-policy DRL algorithms that not only exceed state-of-The-Art performance but are also simple and minimalistic. For standard continuous control benchmarks, Soft Actor-Critic (SAC), which employs entropy maximization, currently provides state-of-The-Art performance. We frst demonstrate that the entropy term in SAC addresses action saturation due to the bounded nature of the action spaces, with this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. We show that both approaches can match SAC s sample effciency performance without the need of entropy maximization, we then propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. Extensive experimental results demonstrate that our proposed sampling scheme leads to state of the art sample effciency on challenging continuous control tasks. We combine all of our fndings into one simple algorithm, which we call Streamlined Off Policy with Emphasizing Recent Experience, for which we provide robust public-domain code.
UR - http://www.scopus.com/inward/record.url?scp=85098410145&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098410145&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85098410145
T3 - 37th International Conference on Machine Learning, ICML 2020
SP - 10012
EP - 10022
BT - 37th International Conference on Machine Learning, ICML 2020
A2 - Daume, Hal
A2 - Singh, Aarti
PB - International Machine Learning Society (IMLS)
Y2 - 13 July 2020 through 18 July 2020
ER -