
Abstract:
Generative robot policies have shown remarkable potential in learning complex, multimodal behaviors from demonstrations. However, at runtime, they still exhibit diverse failures ranging from task incompletion (e.g., toppling or dropping objects) to misaligned behaviors (e.g., placing the gripper inside of a cup of water). Instead of constantly re-training the policies with new data, we seek to leverage the potential of imperfect generative policies for proposing candidate action plans and execute only those which are verified to lead to desirable outcomes. We formalize this underlying problem of policy steering as a stochastic model predictive control (MPC) problem. This formulation reveals two key capabilities: (1) predicting outcomes of candidate action plans and (2) verifying their alignment with the task context and user intent. To unlock runtime policy steering in open-world environments, we propose leveraging world models for dynamics prediction and vision-language models (VLMs) as verifiers, harnessing their respective strengths in dynamics modeling and common-sense reasoning. However, since the representations between world models (latent state) and VLMs (text tokens) are fundamentally different, we introduce a latent-alignment finetuning strategy that enables VLMs to translate nuanced motion details from predicted latent states into behavior narrations, allowing the VLM to reason over these details across diverse task contexts for optimal plan selection. Our hardware results across three robotic manipulation tasks shows that our fully-autonomous policy steering framework improves a base generative imitation-based policy by over 30% even for novel task descriptions, and that our latent-aligned VLM approach outperforms (by ∼40%) alterative VLM approaches that do not decouple the prediction of outcomes from verification.
Committee:
Andrea Bajcsy (Chair)
Oliver Kroemer
Yonatan Bisk
Michelle Zhao