Spatiotemporal Understanding of People Using Scenes, Objects, and Poses - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

June

7
Fri
Rohit Girdhar Robotics Institute,
Carnegie Mellon University
Friday, June 7
12:00 pm to 1:00 pm
NSH 3305
Spatiotemporal Understanding of People Using Scenes, Objects, and Poses

Abstract:
Humans are arguably one of the most important entities that AI systems would need to understand to be useful and ubiquitous. From autonomous cars observing pedestrians to assistive robots helping the elderly, a large part of this understanding is focused on recognizing human actions, and potentially, their intentions. Humans themselves are quite good at this task: we can look at a person and explain in great detail every action they are doing. Moreover, we can reason over those actions over time, and even predict what potential actions they may intend do in the future. Computer vision algorithms, on the other hand, have lagged far behind on this task.

In this thesis, we explore techniques to improve human action understanding from a visual input. Our key insight is that human actions are dependent on their own state (parameterized by their pose), as well as the state of their environment (parameterized by the scene, objects and other people in it). We exploit this dependence in three key ways: (1) Predicting a prior on human actions using affordances of the scenes and objects they interact with; (2) Attending to the person and their surroundings when classifying their actions; and (3) Building systems capable of learning from or aggregating this contextual knowledge over space and time to recognize actions. We further extend these methods to recognize actions in complex multi-person videos, where multiple people are performing multiple different actions at any given time.

However, these methods still mostly look at short time scales. Tackling the goal of recognizing human intentions would require reasoning over long temporal horizons. One reason for the limited progress in this direction is the lack of vision benchmarks that actually require such reasoning. Most video action classification problems are solved fairly well using our previously explored methods by looking at just a few frames. Hence, to remedy that, we propose a new benchmark dataset and tasks that, by design, require reasoning over time to be solved. We believe this would be a first step towards building truly intelligent video understanding systems.

More Information

Thesis Committee Members:
Deva Ramanan, Chair
Abhinav Gupta
Martial Hebert
Andrew Zisserman, University of Oxford
Jitendra Malik, University of California, Berkeley