Title: 6D Object Pose Estimation for Manipulation via Weak Supervision
Abstract:
6D object pose estimation is essential for robotic manipulation tasks. Existing learning-based pose estimators often rely on training from labeled absolute poses with fixed object canonical frames, which (1) requires datasets with annotations of object absolute pose that are resource-intensive to collect; (2) is hard to generalize to novel configurations and unseen objects. Instead, we propose to investigate the relationship between: (a) relative poses of a single object in different configurations; (b) pairs of interacting objects in manipulation tasks. In this thesis, we show that by using relative poses as weak supervision, we can achieve better label-efficiency and generalizability to novel object configurations and unseen objects.
In the first work, we investigate the problem of learning an image-based object pose estimator self-supervised by relative object poses. However, local rotation averaging problems are difficult to converge in training due to the wrap-around nature of the rotational space of SO(3). To tackle this, we propose a new algorithm that utilizes Modified Rodrigues Parameters to stereographically project 3D rotations from the closed manifold of SO(3) to the open manifold of R^3, allowing optimization to be done on an open manifold which makes it more likely to converge. Empirically, we show that the proposed algorithm is able to converge to a consistent relative orientation frame much faster than algorithms that purely operate in the SO(3) space, and subsequently enabling training pose estimators self-supervised by relative poses.
In the second work, we study the problem of learning task-specific relative pose between interacting objects to solve manipulation tasks. For example, hanging a mug on a rack requires us to reason about relative pose between objects. We conjecture that the relative pose between objects is a generalizable notion of a manipulation task that can transfer to new objects in the same category. We define this as ”cross-pose”, and propose a vision-based method that learns to estimate the object cross-pose, which is used to guide a downstream motion planner. Finally, we empirically show that our system is able to generalize to unseen objects in both simulation and the real world from very few demonstrations.
Thesis Committee:
David Held, Chair
Shubham Tulsiani
Mohit Sharma