Efficient Synthetic Data Generation and Utilization for Action Recognition and Universal Avatar Generation - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

December

5
Thu
Emily Kim PhD Student Robotics Institute,
Carnegie Mellon University
Thursday, December 5
3:00 pm to 4:30 pm
NSH 3305
Efficient Synthetic Data Generation and Utilization for Action Recognition and Universal Avatar Generation

Abstract:
Human-centered computer vision technology relies heavily on large, diverse datasets, but collecting data from human subjects is time-consuming, labor-intensive, and raises privacy concerns. To address these challenges, researchers are increasingly using synthetic data to augment real-world datasets.

This thesis explores efficient methods for generating and utilizing synthetic data to train human-based computer vision models. Synthetic data plays two roles: it helps narrow a model’s focus to specific domains and expands large-scale datasets, making models more robust for general tasks. These approaches are applied to two applications: activity recognition (the classification of human actions from sequences of frames) and 3D avatar generation (creating 3D avatars with a few subject images).

In the first part of the thesis, we introduce a dataset suite called REMAG, which includes both real and synthetic data across eleven activity classes, captured from ground and drone cameras (Chapter 1). The synthetic data is generated using four methods, combining traditional computer graphics (CG) or neural rendering with marker-based motion capture or 2D video-tracked motions. Through experiments, we show that fine-tuning models pre-trained on large-scale data in two steps—first on high-quality synthetic data, then on a small amount of real data—can achieve performance comparable to or better than models trained on a larger real dataset. Building on this work, in Section 4.1, we propose a two-step fine-tuning pipeline using small real training data for activity recognition, where in the first step, we generate the synthetic data from the input small training data during each training epoch to diversify the visual domain while preserving motion integrity, then in the second step, we fine-tune the model with the original small real data to optimize performance in the target domain. With these experiments, we hope to demonstrate that our approach can improve the model performance substantially by fine-tuning with only a small amount of real training data.

In the second part of the thesis, we focus on training a 3D avatar generation model called Universal Avatars (UA) using large-scale multi-view datasets of people, such as Ava-256 and Goliath-4 (Chapter 2). While this data allows the model to generalize across different identities and expressions without distorting the original identity, we find that more diverse identity appearances are necessary to be able to generate avatars from any identity. To address this, Section 4.2 introduces a method to expand existing real datasets by generating controllable, geometrically consistent, multi-view, realistic synthetic human portraits. This is achieved by combining StyleGAN, ControlNet, and a reference image encoder through a simple convolutional mapping network. If successful, this method will enable the generation of large, diverse dataset, advancing the UA model’s capacity for universal representation.

Thesis Committee Members:
Jessica Hodgins​, Chair
Fernando de la Torre
Jun-Yan Zhu
Julieta Martinez​, Meta Reality Labs

More Information