Learning Off-Policy with Online Planning
Abstract
We propose Learning Off-Policy with Online Planning (LOOP), combining the techniques from model-based and model-free reinforcement learning algorithms. The agent learns a model of the environment, and then uses trajectory optimization with the learned model to select actions. To sidestep the myopic effect of fixed horizon trajectory optimization, a value function is attached to the end of the planning horizon. This value function is learned through off-policy reinforcement learning, using trajectory optimization as its behavior policy. Furthermore, we introduce "actor-guided'' trajectory optimization to mitigate the actor-divergence issue in the proposed method. We benchmark our methods on continuous control tasks and demonstrate that it offers a significant improvement over the underlying model-based and model-free algorithms.
BibTeX
@workshop{Sikchi-2020-125595,author = {Harshit Sikchi and Wenxuan Zhou and David Held},
title = {Learning Off-Policy with Online Planning},
booktitle = {Proceedings of ICML '20 Inductive Biases, Invariances and Generalization in Reinforcement Learning Workshop},
year = {2020},
month = {July},
keywords = {Online Planning, Model-based Reinforcement Learning, Trajectory Optimization, Reinforcement Learning},
}