Abstract:
Unlike most machine learning applications, robotics involves physical constraints that make off-the-shelf learning challenging. Difficulties in large-scale data collection and training present a major roadblock to applying today’s data-intensive algorithms. Robot learning has an additional roadblock in evaluation: every physical space is different, making results across labs inconsistent.
Two common assumptions of the robot learning paradigm limit data efficiency. First, an agent typically assumes isolated environments and no prior knowledge or experience – learning is done tabula-rasa. Second, agents typically receive only image observations as input, relying on vision alone to learn tasks. However, in the real world, humans learn with many senses across many environments and come with prior experiences as they learn new tasks. This approach is not only practical but also crucial for feasibility in real robotics where it is cost-prohibitive to collect many samples from deployed physical systems.
In this thesis, I present work that lifts these two assumptions, improving the data efficiency of robot learning by leveraging multimodality and pretraining. First, I show how multimodal sensing like sight and sound can provide rich self-supervision. Second, I introduce a framework for pretraining and evaluating self- supervised exploration via environment transfer. I then apply these ideas to real-world manipulation, combining the benefits of large-scale pretraining and multimodality through audio-visual pretraining for contact microphones. Finally, I introduce a real-robot benchmark for evaluating the generalization of both visual and policy learning methods via shared data and hardware.
Thesis Committee Members:
Abhinav Gupta, Chair
David Held
Shubham Tulsiani
Rob Fergus, New York University
Chelsea Finn, Stanford University