
Abstract:
To solve general manipulation tasks in real-world environments, robots must be able to perceive and condition their manipulation policies on the 3D world. These agents will need to understand various common-sense spatial/geometric concepts about manipulation tasks: that local geometry can suggest potential manipulation strategies; that changes in observation viewpoint shouldn’t affect the interpretation of the environment; that policies should adjust when object configurations are adjusted, etc. This thesis explores learning algorithms and visual representations which can imbue agents with geometric reasoning capabilities in a generalizable way while learning from only a small number of demonstrations or examples.
We first explore how agents can learn generalizable 3D affordance representations for articulated objects such as doors, drawers, etc. We propose a family of 3D visual representations which describe the motion constraints for every point on an articulated object. We demonstrate that when trained on a small dataset of simulated articulated objects, our family of 3D affordance representations generalize zero-shot to novel instances of seen object categories, entirely unseen object categories, objects perceived with real-world sensors, and objects which have fundamental ambiguities or uncertainties.
Next, we explore how agents can learn task-critical geometric relationships for object rearrangement tasks from a small number of demonstrations. We design a family of dense 3D representations which can learn correspondence relationships across rigid and non-rigid objects, precisely extract desired rigid-body transformations using novel reasoning layers, and exhibit desirable invariance/equivariance properties under scene transformation. We also explore how these representations can be leveraged to solve sequential rearrangement tasks by integrating behavior cloning and planning.
Finally, we will explore how agents can learn new geometric skills by watching human demonstrations and reasoning explicitly about the geometry of the task. We propose a hierarchical policy-learning framework which factors skill learning into a geometric learning step and a low-level policy learning step. This hierarchy is designed to enable a shared geometric representation space when learning from human demonstrations and robot experiences. We provide initial experiments on a real-world system, and describe the ongoing work to understand whether this approach can improve sample-efficiency of on-robot data while preserving generality.
Thesis Committee Members:
David Held (Chair)
Shubham Tulsiani
Oliver Kroemer
Yuke Zhu (UT Austin)
Jon Scholz (Google DeepMind)