AlignNet: A Unifying Approach to Audio-Visual Alignment - Robotics Institute Carnegie Mellon University

AlignNet: A Unifying Approach to Audio-Visual Alignment

Zhaoyuan Fang, Jianren Wang, and Hang Zhao
Conference Paper, Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV '20), pp. 3298 - 3306, March, 2020

Abstract

We present AlignNet, a model that synchronizes videos with reference audios undernon-uniform and irregularmis- alignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well- established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of- the-art methods. Code, dataset and sample videos are available at our project page 1 .

BibTeX

@conference{Fang-2020-126838,
author = {Zhaoyuan Fang and Jianren Wang and Hang Zhao},
title = {AlignNet: A Unifying Approach to Audio-Visual Alignment},
booktitle = {Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV '20)},
year = {2020},
month = {March},
pages = {3298 - 3306},
}