Audio-Visual State-Aware Representation Learning from Interaction-Rich, Egocentric Videos
Abstract
We propose a self-supervised algorithm to learn representations from egocentric video data using multiple modalities of video and audio. In robotics and augmented reality, the input to the agent is a long stream of video from the first-person or egocentric point of view. Towards this end, recently there have been significant efforts to capture humans from their first-person/egocentric view interacting with their own environments as they go about their daily activities. As a result, several large-scale egocentric, interaction-rich, multi-modal datasets have emerged. However, learning representations from such videos can be quite challenging.
First, given the uncurated nature of long, untrimmed, continuous videos, learning effective representations require focusing on moments in time when interactions take place. A real-world video consists of many non-activity segments which are not conducive to learning. Second, visual representations of daily activities should be sensitive to changes in the state of the object and the environment. In other words, the representations should be state-aware. However, current successful multi-modal learning frameworks encourage representations that are invariant to time and object states.
To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. Motivated by the observation of a sharp audio signal associated with an interaction, we also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, object state change classification, and point-of-no-return temporal localization.
BibTeX
@mastersthesis{Mittal-2023-135788,author = {Himangi Mittal},
title = {Audio-Visual State-Aware Representation Learning from Interaction-Rich, Egocentric Videos},
year = {2023},
month = {April},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-09},
keywords = {Video Representation learning, self-supervised learning, contrastive learning, audio-visual learning, multi-modal machine learning, egocentric videos, Ego4D, EPIC-Kitchens},
}