Abstract:
Reconstructing 3D scenes and objects from images alone has been a long-standing goal in computer vision. We have seen tremendous progress in recent years, capable of producing near photo-realistic renderings from any viewpoint. However, existing approaches generally rely on a large number of input images (typically 50-100) to compute camera poses and ensure view consistency. This constraint limits the applicability of these methods as taking 100 high-quality images free of motion blur can be burdensome for end users. To enable 3D reconstructions in unconstrained scenes, this thesis proposes techniques for sparse-view 3D, automatically estimating camera poses and reconstructing 3D objects in the wild from less than 10 images.
We start by exploring how implicit surfaces can be used to regularize 3D representations learned from sparse views. Our proposed representation captures the geometry of the scene as a water-tight surface and models the view-dependent appearance by factoring the diffuse color (albedo) and specular lighting. We demonstrate that our representation can robustly reconstruct 3D from as few as 4-8 images associated with noisy camera poses. However, acquiring this camera pose initialization in the first place is challenging. To address this, we propose an energy-based framework that predicts probability distribution over relative camera rotations. These distributions are then composed into coherent sets of camera rotations given sparse image sets. We then show how leveraging a transformer-based architecture to scale our energy-based representation can effectively make use of more images. We find that additional image context allows our method to resolve ambiguities that arise from just two images. To predict 6D poses, we also propose a new coordinate system that disentangles predicted camera translation from rotations. Our method generalizes effectively to new object categories and in-the-wild images. While top-down energy-based pose estimation can effectively handle pose ambiguity, it can be slow to sample poses and does not make use of level features that may provide useful cues for correspondence matching and geometric consistency. To address these issues, we propose to represent a camera as a bundle of rays passing from the camera center to the center of each image patch in 3D. We then train a diffusion-based denoising network to predict this representation. We find that this generic camera representation significantly improves pose accuracy.
Thesis Committee Members:
Deva Ramanan, Co-chair
Shubham Tulsiani, Co-chair
Martial Hebert
William Freeman, Massachusetts Institute of Technology
Noah Snavely, Cornell University