Discovering Audiovisual Ontology - Robotics Institute Carnegie Mellon University

Discovering Audiovisual Ontology

Master's Thesis, Tech. Report, CMU-RI-TR-21-52, Robotics Institute, Carnegie Mellon University, August, 2021

Abstract

The shrill of an ambulance siren and flashing lights, the hum of an accelerating car ---
important events often come to us simultaneously through sight and sound. In this paper,
we consider the problem of identifying these events from raw, unlabeled audio-visual data
of autonomous agents interacting with urban environments.
Our goal is to discover a taxonomy of multimodal events of which autonomous agents should be aware.
Our underlying thesis is that multimodal events such as emergency vehicle sirens, honks from interacting actors,
and reverse backup beepers from large trucks, all should be added to current perception ontologies
(which tend to be dominated by visual event-driven categories rather than multimodal categories).
We show that this discovery task can be formulated as a multimodal self-supervised learning problem, where we train a network to predict
whether pairs of visual and audio are in correspondence.
We demonstrate our technique on a dataset containing hundreds of hours of in-the-wild
dataset of urban walking videos. In comparisons with baseline methods, we show that the
resulting model discovers significantly larger numbers of "actionable" events that affect behavior,
such as nearby ambulances with flashing sirens.

BibTeX

@mastersthesis{Wang-2021-129239,
author = {Haochen Wang},
title = {Discovering Audiovisual Ontology},
year = {2021},
month = {August},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-21-52},
}