PhD Thesis Proposal
August
Vidhi Jain
PhD Student
Robotics Institute,
Carnegie Mellon University
Thursday, August 8
12:00 pm to 1:30 pm
NSH 4305
Multimodal Representations for Adaptable Robot Policies in Human-Inhabited Spaces
Abstract:
Human beings sense and express themselves through multiple modalities. To capture multimodal ways of human communication, I want to build adaptable robot policies that infer task pragmatics from video and language prompts, reason about sounds and other sensors, take actions, and learn mannerisms of interacting with people and objects. Existing solutions for robot policies rely on visual environment observations and structured language as goals. However, these assumptions limit sensory observations of the environment and the expressivity of the desired task from a user’s perspective. In this thesis, I present learning approaches for adaptable robot policies using different modalities to explicitly and implicitly convey task constraints to a learned robot policy.
The thesis proposal is organized into two parts: (1) completed and ongoing work focusing on video, language, and audio modalities, and (2) proposed work combining multiple modalities to fast adaptation of Robotics Foundation Models like OpenVLA. First, I present how we can train robot policies to infer the underlying tasks and preferences from the visual demonstration. I show how to infer the implicit task shown with cross-attention transformers to perform simulated dish-loading tasks. Then, I apply the same philosophy to train policies that infer the underlying task semantics from raw pixels in a prompt video and execute it in the robot’s own environment. Second, I challenge the strong assumptions about language-based goal conditioning in robot policies. In one of the main works, I present sample-efficient robot policies that use the hierarchical decomposition of language into a sequence of interaction points and their relative waypoints.
I am currently developing a learning algorithm that allows a robot to efficiently predict how loud the robot’s action noise is at the listener’s location and plan its actions accordingly. Humans inherently understand how their actions impact the acoustic environment around them, and we need this ability in home robots too. We train our model to visually predict how loud a listener may perceive the robot’s noise at different indoor locations. Through these works, I study each modality individually, particularly how each modality can be used to improve the diversity of tasks performed and the ease of use of robots at home. Moving forward, I propose to enhance state perception and task specifications for more rapid adaptation and versatile robot control. Having examined how robots can explicitly and implicitly understand the specified task using different sensing modalities, my aim is to develop fast adaptation algorithms that connect foundational models by adapting them for visual cues and pragmatic task instructions.
Thesis Committee Members:
Yonatan Bisk, Chair
Oliver Kroemer
Henny Admoni
Dieter Fox, University of Washington and NVIDIA