Sparse-view 3D in the Wild - Robotics Institute Carnegie Mellon University

Sparse-view 3D in the Wild

PhD Thesis, Tech. Report, CMU-RI-TR-24-09, May, 2024

Abstract

Reconstructing 3D scenes and objects from images alone has been a long-standing goal in computer vision. We have seen tremendous progress in recent years, capable of producing near photorealistic renderings from any viewpoint. However, existing approaches generally rely on a large number of input images (typically 50-100) to compute camera poses and ensure view consistency. This constraint limits the applicability of these methods, as taking 100 high-quality images without motion blur can be burdensome for end users. To enable 3D reconstructions in unconstrained scenes, this thesis proposes techniques for sparse-view 3D, automatically estimating camera poses and reconstructing 3D objects in the wild from less than 10 images.

We start by exploring how implicit surfaces can be used to regularize 3D representations learned from sparse views. We demonstrate that our representation, which factors view-dependent specular effects from view-independent diffuse appearance, can robustly reconstruct 3D from as few as 4-8 images associated with noisy camera poses. However, acquiring this camera pose initialization in the first place is challenging. To address this, we propose an energy-based framework that predicts the probability distribution over relative camera rotations. These distributions are then composed into coherent sets of camera rotations given sparse image sets. We then show how leveraging a transformer-based architecture to scale our energy-based representation can effectively make use of more images. We find that additional image context allows our method to resolve ambiguities that arise from just two images. While top-down energy-based pose estimation can effectively handle pose ambiguity, it can be slow to sample poses and does not make use of level features that may provide useful cues for correspondence matching and geometric consistency. To address these issues, we propose to represent a camera as a bundle of rays passing from the camera center to the center of each image patch in 3D. We then train a diffusion-based denoising network to predict this representation. We find that this generic camera representation significantly improves pose accuracy.

BibTeX

@phdthesis{Zhang-2024-140557,
author = {Jason Y. Zhang},
title = {Sparse-view 3D in the Wild},
year = {2024},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-24-09},
keywords = {3D Reconstruction, Pose Estimation, 3D Computer Vision},
}