Abstract:
Since infancy, humans acquire motor skills, behavioral priors, and objectives by learning from their caregivers. Similarly, as we create humanoids in our own image, we aspire for them to learn from us and develop universal physical and cognitive capabilities that are comparable to, or even surpass, our own. In this thesis, we explore how to equip humanoids with the mobility, dexterity, and environmental awareness necessary to perform meaningful tasks. Unlike previous efforts that focus on learning a narrow set of tasks—such as traversing terrains, imitating a few human motion clips, or playing a single game—we emphasize scaling humanoid control tasks by leveraging large-scale human data (e.g., motion, videos). We show that scaling brings numerous benefits, gradually moving us closer to achieving a truly “universal” capability.
We begin by scaling the reinforcement learning-based motion imitation framework, enabling humanoids (without hands) to imitate large-scale human motion data rather than small, curated datasets. This motion imitator forms the foundation for developing motor skills: given kinematic reference motion, the learned imitator can control the humanoid to perform both daily activities and more complex motions, such as acrobatics and martial arts. In essence, the imitator allows for diverse motion control on demand. We verify this framework in both simulation and on a real-world full-sized humanoid.
While the imitator provides low-level motor skills, completing meaningful tasks requires higher-level guidance to generate kinematic motion goals. In the second part of this thesis, we explore how to equip humanoids with sensing and behavioral priors. One approach is to freeze the imitator as a low-level controller and train a high-level ‘kinematic policy’ to compute poses as input for the imitator. We propose using this ‘kinematic policy’ for first-person and third-person pose estimation from video input. Additionally, motor skills can be distilled into a new policy for direct end-to-end control. For example, we demonstrate controlling a simulated avatar using sensor input from a head-mounted headset. The imitator can also be distilled into a universally applicable latent motion representation, which serves as a prior for downstream tasks (e.g., pedestrian animation, VR controller tracking, and motion generation), providing both motor skill and behavioral guidance. With this latent motion representation, we can equip the humanoid with dexterity, enabling it to grasp diverse objects and follow complex trajectories.
We propose two new directions. First, we aim to combine mobility, dexterity, and perception to learn visual-dexterous control policies for humanoids. Next, we plan to transfer these capabilities to real humanoids.
Thesis Committee Members:
Kris Kitani, Chair
Gunaya Shi
Shubham Tulsiani
Xue Bin Peng, Simon Fraser University, Nvidia
Karen Liu, Stanford University