Carnegie Mellon University
Abstract:
Offline reinforcement learning (RL) holds promise as a means to learn high-value policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative objectives that favor pessimistic value functions or rely on generative modelling with noisy Monte Carlo return-to-go samples for reward conditioning. The key challenge in offline RL is identifying the behavioral primitives (a.k.a skills) that exist in the offline dataset, and chain these behaviors together to produce high-value policies.
In this thesis, we investigate latent variable models having different levels of expressiveness to model skills as compressed latent vectors, and then compose these skills to solve a specific task. We first describe a Variational Autoencoder (VAE) based method which learns a temporally abstract world model that predicts the state outcome of executing a skill, and then uses this model to do Online Planning with Offline Skill Models (OPOSM). We then extend this method to instead work with a Vector-Quantized VAE to learn a bank of discrete latents skills (VQSkills). Finally we investigate using latent diffusion models to learn a multimodal skill prior, and then use this prior to perform batch constrained Q-learning. We call this algorithm Latent Diffusion Constrained Q-Learning (LDCQ). We empirically demonstrate the effectiveness of these algorithms to learn high-value policies in the D4RL benchmark suite.
Committee:
Prof. Jeff Schneider (advisor)
Prof. David Held
Lili Chen