Telling Left from Right: Learning Spatial Correspondence Between Sight and Sound - Robotics Institute Carnegie Mellon University
Loading Events

VASC Seminar

June

22
Mon
Bryan Russell Senior Research Scientist Adobe Research
Monday, June 22
11:00 am to 12:00 pm
Telling Left from Right: Learning Spatial Correspondence Between Sight and Sound

Virtual VASC Seminar:  https://cmu.zoom.us/j/92741882813?pwd=R1R0eGRaeXFHTEF2VWNwY2VIZmU5Zz09

Abstract:  Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. In my talk, I’ll describe a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our model, we introduce a large-scale video datasets, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio. This work is in collaboration with Karren Yang (MIT) and Justin Salamon (Adobe). Note: to fully appreciate the results in this talk, please wear wired headphones during the talk.

 

Bio:  Bryan Russell is a Senior Research Scientist at Adobe Research where he focuses on problems in video and 3D understanding.

Homepage:

http://bryanrussell.org/

 

 

Sponsored in part by:   Facebook Reality Labs Pittsburgh