Learning 3D Registration and Reconstruction from the Visual World - Robotics Institute Carnegie Mellon University

Learning 3D Registration and Reconstruction from the Visual World

PhD Thesis, Tech. Report, CMU-RI-TR-21-13, Robotics Institute, Carnegie Mellon University, June, 2021

Abstract

Humans learn to develop strong senses for 3D geometry by looking around in the visual world. Through pure visual perception, not only can we recover a mental 3D representation of what we are looking at, but meanwhile we can also recognize where we are looking at the scene from. Finding the 3D scene representation from RGB images (i.e. 3D reconstruction) and localizing the camera frames (i.e. registration) are two long-standing problems in computer vision. Simultaneously solving both tasks makes it an even more challenging chicken-and-egg problem --- recovering the 3D structure requires observations with accurate camera poses, while localizing the cameras requires reliable correspondences from the reconstruction.

In this thesis, we explore the problem of learning geometric alignment and dense 3D reconstruction from images and videos using self-supervised learning techniques. Toward this end, we discuss the general importance of factorizing geometric information from visual data. First, we build up the theoretical foundations from learning-based planar registration for images. We show that incorporating geometric priors in learning models increases learning efficacy for alignment algorithms, and we demonstrate both discriminate and generative applications of image registration with modern deep neural networks. Second, we explore more complex 3D shape priors parametrized by neural networks, which we train from images without 3D supervision by utilizing differentiable rendering techniques. We develop methods for learning from multi-view depth observations and even single-view supervision from static RGB images. Finally, we investigate the challenging problem of joint optimization of 3D registration and reconstruction. Given a video sequence, we demonstrate how one can exploit pretrained 3D shape priors to register and refine the shape reconstruction to the video sequences, as well as a more generic rendering prior for learning neural 3D scene representations from unknown camera poses.

Images and videos contain very rich and detailed information about the 3D world. Baking in suitable geometric priors allows learning models to effectively recover both the dense 3D scene structures and the corresponding camera poses using image synthesis as the proxy objective. We believe this is an essential ingredient towards scalable learning of in-the-wild spatial 3D understanding for future AI systems.

BibTeX

@phdthesis{Lin-2021-127981,
author = {Chen-Hsuan Lin},
title = {Learning 3D Registration and Reconstruction from the Visual World},
year = {2021},
month = {June},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-21-13},
keywords = {registration, image alignment, dense 3D reconstruction, self-supervised learning, structure from motion, multi-view geometry, photometric optimization, neural rendering, neural scene representations},
}