Abstract:
We explore how to infer the time-varying 3D structures of generic, deformable objects, and dynamic scenes from monocular videos. A solution to this problem is essential for virtual reality and robotics applications. However, inferring 4D structures given 2D observations is challenging due to its under-constrained nature. In a casual setup where there is neither complete sensor measurement nor rich 3D supervision, one needs to tackle three challenges — (1) Registration: how to find correspondence of pixels and track the camera frame over time? (2) Scale ambiguity: how to lift 2D observations to 3D? (3) Occlusion: how to infer the structures that are not observable due to self-occlusion or occlusion by the others?
We first study the 4D reconstruction problem in a single video setup and then extend it to multiple videos, different instances, and scenes. Inspired by analysis-by-synthesis, we set up an inverse graphics problem and solve it with generic data-driven priors. Inverse graphics models (e.g., differentiable rendering, differentiable physics simulation) approximate the true generation process of a video with differentiable operations, allowing one to inject prior knowledge about the physical world and compute gradients to update the model parameters. Generic data-driven priors (e.g., optical flow, pixel features, viewpoint) provide guidance to register pixels to a canonical 3D space, which allows us to fuse observations over time and across similar instances. Building upon these observations, we develop methods to capture 4D models of deformable objects and dynamic scenes from in-the-wild video footage. In the end, we show that offline-optimized 4D models can be distilled into efficient neural architectures, enabling real-time reconstruction.
Thesis Committee Members:
Deva Ramanan, Chair
Shubham Tulsiani
Jessica Hodgins
Yaser Sheikh
Angjoo Kanazawa, UC Berkeley