Towards Reconstructing Non-rigidity from Single Camera
Abstract
In this document, we study how to infer 3D from images captured by a single camera, without assuming the target scenes / objects being static. The non-static setting makes our problem ill-posed and challenging to solve, but is vital in practical applications where target-of-interest is non-static. To solve ill-posed problems, the current trend in the field is to learn inference models e.g. neural networks on datasets with labeled groundtruth. Instead, we attempt a data-less approach without requiring datasets with 3D annotations. This poor man's approach is beneficial to tasks which lack well annotated datasets.
Our works are grouped into two parts.
(i) We first introduce our series of works on non-rigid structure from motion (NR-SfM) and its application to learn 3D landmark detectors with only 2D landmark annotations. Our general framework is a two stage approach -- we design a novel NR-SfM module to reconstruct shape and camera poses from input 2D landmarks, and then these are used to teach a neural network to detect 3D landmarks from image inputs. We propose techniques to make the NR-SfM module scalable to large datasets and robust to missing data. We also propose a new loss to let the 3D landmark detector learn more efficiently from the NR-SfM module.
(ii) We then present works on reconstructing dense dynamic scenes. Dense reconstruction is challenging for NR-SfM algorithms mainly due to the difficulty in getting reliable long-term correspondences for every pixel. On the other hand, being able to reconstruct every pixel of the scene is necessary for applications like novel view synthesis. Therefore, we investigate solutions without the need for long-term correspondences. As a preliminary exploration, we first take a data-driven approach, by collecting videos from Internet and train a depth estimation network. Despite the simplicity of this approach, it lacks geometric reasoning and consequently limited in its generalizability. We then explore an analysis-by-synthesis approach, where we leverage recent advances in differentiable neural rendering and represent dynamic scenes using deformable neural radiance fields (D-NeRF). Prior D-NeRF-based methods only use photometric loss for optimization, which we find is not sufficient to recover rapid object motions. We present a new method for D-NeRFs that can directly use optical flow as supervision. We overcome the major challenge with respect to the computational inefficiency of enforcing the flow constraints to the deformation field used by D-NeRFs. We present results on novel view synthesis with rapid object motion, and demonstrate significant improvements over baselines without flow supervision.
BibTeX
@phdthesis{Wang-2023-136153,author = {Chaoyang Wang},
title = {Towards Reconstructing Non-rigidity from Single Camera},
year = {2023},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-25},
keywords = {NR-SfM, depth estimation, dynamic novel view synthesis, dynamic NeRF, deformable NeRF, 2D-3D lifting, unsupervised pose estimation},
}