
12:00 pm to 1:30 pm
Rashid Auditorium 4401
Title: Learning World Simulators from Data |
Abstract:
Modern foundational models have achieved superhuman performance in many logic and mathematical reasoning tasks by learning to think step by step. However, their ability to understand videos, and, consequently, control embodied agents, lags behind. They often make mistakes in recognizing simple activities, and often hallucinate when generating videos. This raises a fundamental question: What is the equivalent of thinking step-by-step for visual recognition and prediction?
In this talk, we argue that step-by-step visual reasoning has much to do with inverting a physics simulator, that is, mapping raw video pixels back to a structured, 3D-like neural representation of the world. This involves inferring 3D neural representations of objects, parts, their 3D motion and appearance trajectories, estimating camera movements and 3D scene structure and physics properties. We will discuss methods to automatically extract such 3D neural representations from images and videos using generative model priors and end-to-end feed-forward models. We will present methods that inject such knowledge of camera motion and 3D scene structure in modern VLMs and show it improves their ability to ground language and control robot manipulators.
How can we scale up annotations for such simulator inversion? We will discuss methods that use generative models of language and vision to automate development of 3D simulations in physics engines. Additionally, we will discuss our efforts in developing faster and more general physics engines. The integration of physics engines with generative models aims to automate the replication of real physical environments within the physics simulator, enabling more accurate and scalable world simulation data for sim-to-real learning of 3D perception and action. We believe such real-to-sim and sim-to-real learning paradigms are very hopeful for developing robots that can see and think accurately, step-by-step.

Bio:
Katerina Fragkiadaki is the JPMorgan Chase Associate Professor in the Machine Learning Department in Carnegie Mellon University. She received her undergraduate diploma from Electrical and Computer Engineering in the National Technical University of Athens.
She received her Ph.D. from University of Pennsylvania and was a postdoctoral fellow in
UC Berkeley and Google research after that. Her work focuses on combining
forms of common sense reasoning, such as spatial understanding and 3D scene understanding, with deep visuomotor learning. The goal of her work is to enable few-shot learning and continual learning for perception, action and language grounding. Her group develops methods for computer vision for mobile agents, 2D and 3D visual parsing, 2D-to-3D perception, vision-language grounding, learning of object dynamics, navigation and manipulation policies. Pioneering innovations of her group’s research include 2D-to-3D geometry-aware neural networks for 3D understanding from 2D video streams, analogy-forming networks for memory-augmented few-shot visual parsing, and language-grounding in 2D and 3D scenes with bottom-up and top-down attention. Her work has been awarded with a best Ph.D. thesis award, an NSF CAREER award, AFOSR Young Investigator award, a DARPA Young Investigator award, Google, TRI, Amazon, NVIDIA, UPMC and Sony faculty research awards. She was a program chair for ICLR 2024.
About the Lecture: The Yata Memorial Lecture in Robotics is part of the School of Computer Science Distinguished Lecture Series. Teruko Yata was a postdoctoral fellow in the Robotics Institute from 2000 until her untimely death in 2002. After graduating from the University of Tsukuba, working under the guidance of Prof. Yuta, she came to the United States. At Carnegie Mellon, she served as a post-doctoral fellow in the Robotics Institute for three years, under Chuck Thorpe. Teruko’s accomplishments in the field of ultrasonic sensing were highly regarded and won her the Best Student Paper Award at the International Conference on Robotics and Automation in 1999. It was frequently noted, and we always remember, that “the quality of her work was exceeded only by her kindness and thoughtfulness as a friend.” Join us in paying tribute to our extraordinary colleague and friend through this most unique and always exciting lecture.