Abstract:
In this talk, I will describe a data-driven method for inferring camera poses given a sparse collection of images of an arbitrary object. This task is a core component of classic geometric pipelines such as structure-from-motion (SFM), and also serves as a vital pre-processing requirement for contemporary neural approaches (e.g. NeRF) to object reconstruction. In contrast to existing correspondence-driven methods that do not perform well given sparse views, we propose a top-down prediction driven approach for estimating camera poses. Our key technical insight is the use of an energy-based formulation for representing distributions over relative camera transformations, thus allowing us to explicitly represent multiple camera modes arising from object symmetries and or views. Leveraging these relative predictions, we jointly estimate a consistent set of camera poses from multiple images. We show that our approach outperforms state-of-the-art SfM and SLAM methods and direct pose regression given sparse images on both seen and unseen categories. Our system can be a stepping stone toward in-the-wild reconstruction from multi-view datasets.
Committee:
Deva Ramanan
Abhinav Gupta
David Held
Brian Okorn