Fine-Tuning Offline Reinforcement Learning with Model-Based Policy Optimization

Adam Villaflor, John M. Dolan, and Jeff Schneider

Workshop Paper, NeurIPS '20 Offline Reinforcement Learning Workshop, December, 2020

View Publication

Abstract

In offline reinforcement learning (RL), we attempt to learn a control policy from a fixed dataset of environment interactions. This setting has the potential benefit of allowing us to learn effective policies without needing to collect additional interactive data, which can be expensive or dangerous in real-world systems. However, traditional off-policy RL methods tend to perform poorly in this setting due to the distributional shift between the fixed dataset and the learned policy. In particular, they tend to extrapolate optimistically and overestimate the action-values outside of the dataset distribution. Recently, two major avenues have been explored to address this issue. First, behavior-regularized methods that penalize actions that deviate from the demonstrated action distribution. Second, uncertainty-aware model-based (MB) methods that discourage state-actions where the dynamics are uncertain. In this work, we propose to unify these two approaches into a single two-stage algorithmic framework. In the first stage, we train a policy using behavior-regularized model-free RL on the offline dataset. Then, a second stage where we fine-tune the policy using our Model-Based Behavior-Regularized Policy Optimization (MB2PO) algorithm. We demonstrate that for certain tasks and dataset distributions our conservative model-based fine-tuning can greatly increase performance and allow the agent to generalize and outperform the demonstrated behavior. We evaluate our method on a variety of the Gym-MuJoCo tasks in the D4RL benchmark and demonstrate that our method is competitive and in some cases superior to the state-of-the-art for most of the evaluated tasks.

BibTeX

@workshop{Villaflor-2020-129559,
author = {Adam Villaflor and John M. Dolan and Jeff Schneider},
title = {Fine-Tuning Offline Reinforcement Learning with Model-Based Policy Optimization},
booktitle = {Proceedings of NeurIPS '20 Offline Reinforcement Learning Workshop},
year = {2020},
month = {December},
keywords = {offline reinforcement learning, model-based policy optimization, off-policy reinforcement learning},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.