Dynamic 3D Reconstruction from the Crowd - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

May

1
Mon
Minh Phuoc Vo
Monday, May 1
4:00 pm to 5:00 pm
NSH 1109
Dynamic 3D Reconstruction from the Crowd

Abstract:

With the advent of affordable and high-quality smartphone cameras, any significant event, such as a wedding ceremony, a surprised birthday party, or a concert, can be easily captured from multiple of cameras. Automatically organizing such large scale visual data and creating a comprehensive 3D scene model for event browsing is an unsolved problem. State of the art 3D reconstruction algorithms are mostly applicable to static scenes, mainly due to the lack of triangulation constraints for dynamic objects and the difficulties in finding dense correspondences across cameras.

This thesis proposes a framework for dense shape reconstruction of dynamic scenes captured by multiple unsynchronized video cameras in an unconstrained crowd-captured settings. The key is to exploit additional cues from the scene semantics, the physics of motion dynamics, and the physics of image formation for shape estimation problem. These cues help us to solve problem hierarchically in three stages: coarse human reconstruction using semantic priors, sparse but accurate trajectory reconstruction of salient features using motion prior, and dense photorealistic reconstruction of object instances using the shape and the appearance priors.

More specifically, we first introduce a simple but effective multiview person association algorithm that is not only used for rough temporal alignment the video sequences but also jointly recover coarse the camera poses and the 3D human skeletons for crowded dynamic events. The key to our algorithm is a pose-insensitive human appearance feature embedding, which is learned by Convolutional Neural Network (CNN) on a combined datasets of surveillance images. The model is finetuned on our testing videos by harnessing labels automatically generated using mutual exclusive constraint, multiview constraint, and motion information.

Second, we present a spatiotemporal bundle adjustment algorithm to accurately estimate a sparse set 3D trajectories of dynamic objects from multiple unsynchronized handheld video cameras. The lack of triangulation constraint on dynamic points is solved by carefully integrating physics-based motion prior describing how points move over time. Our algorithm jointly optimizes for camera intrinsics and extrinsics, triangulating 3D static points, as well as sub-frame temporal alignment between cameras and estimating 3D trajectories of dynamic points.

Third, we propose a shading-aware spatiotemporal shape refinement algorithm to recover temporally consistent detailed human body shapes. We implicitly solve for dense multiview correspondences by explicit modeling of the image formation: the image intensity is a function of the object surface normal, its reflectance, and the scene illumination. We jointly solve for the most likely shape, reflectance, and illumination, guided by a sparse set of dynamic feature correspondences, the object shape and appearance priors, that exactly reproduces the video frames.

We assemble those three innovations into a pipeline for event virtualization. The proposed system will be validated on a collection of real-world social events captured from the crowd in unconstrained settings. It offers an immersive 4D event browsing experience of the past joyful events.

More Information

Thesis Committee Members:
Srinivasa Narasimhan, Chair
Yaser Sheikh
Michael Kaess
Marc Pollefeys, ETH/Microsoft