Building 4D Models of Objects and Scenes from Monocular Videos
Abstract
This thesis studies how to infer time-varying 3D structures of generic, deformable objects, and dynamic scenes from monocular videos. In a casual setup without sufficient sensor observation or rich 3D supervision, one needs to tackle the challenges of registration, scale ambiguity, and limited views. Inspired by analysis-by-synthesis, we set up an inverse graphics problem and solve it with generic data-driven priors. Inverse graphics models approximate the true generation process of a video with differentiable operations (e.g., differentiable rendering and physics simulation), allowing one to inject prior knowledge about the physical world. Generic data-driven priors (e.g., motion correspondence, pixel descriptors, viewpoints) provide guidance to register pixels to a canonical 3D space, which allows one to fuse observations over time and across similar instances. Building upon these ideas, we develop methods to capture 4D models of deformable objects and dynamic scenes from in-the-wild video footage.
BibTeX
@phdthesis{Yang-2023-137118,author = {Gengshan Yang},
title = {Building 4D Models of Objects and Scenes from Monocular Videos},
year = {2023},
month = {July},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-54},
keywords = {4D Reconstruction from Videos, Inverse Graphics; Dynamic Scene Perception},
}