Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

Adam Villaflor, Zhe Huang, Swapnil Pande, John M. Dolan, and Jeff Schneider

Conference Paper, Proceedings of (ICML) International Conference on Machine Learning, pp. 22270 - 22283, July, 2022

View Publication

Abstract

Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method's superior performance on a variety of autonomous driving tasks in simulation.

BibTeX

@conference{Villaflor-2022-134789,
author = {Adam Villaflor and Zhe Huang and Swapnil Pande and John M. Dolan and Jeff Schneider},
title = {Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning},
booktitle = {Proceedings of (ICML) International Conference on Machine Learning},
year = {2022},
month = {July},
pages = {22270 - 22283},
keywords = {offline reinforcement learning, optimism bias},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.