Carnegie Mellon University
Abstract:
Currently, robot manipulation is a special purpose tool, restricted to isolated environments with a fixed set of objects. In order to make robot manipulation more general, robots need to be able to perceive and interact with a large number of objects in cluttered scenes. Traditionally, object pose has been used as a representation to facilitate these interactions. While object pose has many benefits, several limitations become apparent when we investigate how to train an object pose estimator. Traditionally, to train pose estimators, we need to collect a large dataset of annotated object images for supervision. In addition to this data collection being a potentially costly endeavor, pose estimators trained on such datasets do not generalize to novel objects outside of the training dataset. Further, the pose representation itself does not capture task-specific object interactions.
In this thesis we explore different methods of alleviating these limitations of training object pose estimators. First, we propose a method that can estimate the pose of objects that were unknown at training time. To solve this problem, we introduce a novel method for zero-shot object pose estimation in clutter that combines classical pose hypothesis generation and a learned scoring function. Second, we evaluate the convergence properties of learning pose estimation from relative pose annotations using gradient-based optimization methods. We find that naively using such supervision can lead to poor convergence. Using this analysis, we develop a method to better leverage relative annotations when training pose estimators using gradient-based optimization. Finally, we develop a method to model the object-to-object relationships required for completing a task. Rather than separately estimating the pose of each object, we show how we can learn to estimate a task-specific relative pose from a small number of demonstrations that generalizes to novel objects. We find that such a formulation is naturally translationally equivariant and is able to focus on the components of each object that are key to completing the given task.
Thesis Committee Members:
David Held, Co-chair
Martial Hebert, Co-chair
Oliver Kroemer
Silvio Savarese, Stanford University / Salesforce