Abstract:
The longstanding dream of many roboticists is to see robots perform diverse tasks in diverse environments. To build such a robot that can operate anywhere, many methods train on robotic interaction data. While these approaches have led to significant advances, they rely on heavily engineered setups or high amounts of supervision, neither of which is scalable. How can we move towards training robots that operate autonomously, in the wild? Unlike computer vision and natural language in which a staggering amount of data is available on the internet, robotics faces a chicken-and-egg problem: to train robots to work in diverse scenarios, we need a large amount of robot data from diverse environments but to collect this kind of data, we need robots to be deployed widely – which is feasible only if they are already proficient. How can we break this deadlock?
The proposed solution, and the goal of my thesis, is to use an omnipresent source of rich interaction data — humans. Fortunately, there are plenty of real-world human interaction videos on the internet, which can help bootstrap robot learning by side-stepping the expensive aspects of the data collection-training loop. To this end, we aim to learn manipulation from watching humans perform various tasks. We circumvent the embodiment gap by imitating the effect the human has on the environment, instead of the exact actions. We obtain interaction priors, and subsequently practice directly in the real world to improve. To move beyond explicit human supervision, the second work in the thesis aims to predict robot-centric visual affordances: where to interact and how to move post interaction, directly from offline human video datasets. We show that this model can be seamlessly integrated into any robot learning paradigm. However, visual affordances may struggle to capture complex action spaces, especially in high degree-of-freedom robots such as dexterous hands. Thus, in the third and fourth works of the thesis, we explore how to learn more explicit, physically grounded action priors from human videos, mainly in the context of dexterous manipulation. Finally, the proposed work focuses on learning general-purpose, actionable representations, and being able to predict affordances that are both physically grounded, i.e. contain 3D knowledge and explicit high dimensional actions, and allow for functional manipulation.
Thesis Committee Members:
Deepak Pathak, Co-chair
Abhinav Gupta, Co-chair
Katerina Fragkiadaki
Russ Tedrake, MIT
Shuran Song, Stanford