Multi-Human 3D Reconstruction from Monocular Videos - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

September

5
Tue
Rawal Khirodkar PhD Student Robotics Institute,
Carnegie Mellon University
Tuesday, September 5
3:00 pm to 5:00 pm
NSH 4305
Multi-Human 3D Reconstruction from Monocular Videos

Abstract:
We study the problem of multi-human 3D reconstruction from videos captured in the wild. Human movements are dynamic, and accurately reconstructing them in various settings is crucial for developing immersive social telepresence, assistive humanoid robots, and augmented reality systems. However, creating such a system requires addressing fundamental issues with previous works regarding the data and model architectures. In this thesis, we develop several large-scale 3D benchmarks designed to evaluate multi-human reconstruction under demanding conditions and top-down algorithms robust to occlusion and crowded environments.

Data: Obtaining 3D supervision at scale for deep learning models is crucial for achieving real-world generalization. However, unlike the large-scale 2D datasets, the diversity of the 3D datasets is significantly limited – primarily because manually annotating in the 3D space is impractical. Consequently, most 3D benchmarks are limited to indoor environments or, at most, two human subjects outdoors, with stationary or slow camera movements and minimal occlusion. To address this gap, we explore using 3D synthetic data and construct two real multi-human 3D datasets that incorporate dynamic human activities, rapid camera movements, and human-human contact, largely neglected in previous benchmarks, to highlight the critical limitations of the existing methods.

Methodology: A general multi-human 3D reconstruction method should be robust to scale variations and occlusions and incorporate absolute depth understanding. We introduce algorithms with these traits in 2D and 3D settings, which enable reasoning about multiple humans in dynamic environments and crowded scenes. Our top-down approach exploits spatial-contextual information to reason about severely occluded humans in the 3D scene.

Building upon these two components, we develop general 3D methods that reconstruct multiple humans in dynamic scenes from in-the-wild videos.

Thesis Committee Members:
Kris Kitani, Chair
Deva Ramanan
Shubham Tulsiani
Angjoo Kanazawa, UC Berkeley

More Information