AlignNet: A Unifying Approach to Audio-Visual Alignment
Conference Paper, Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV '20), pp. 3298 - 3306, March, 2020
Abstract
We present AlignNet, a model that synchronizes videos with reference audios undernon-uniform and irregularmis- alignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well- established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of- the-art methods. Code, dataset and sample videos are available at our project page 1 .
BibTeX
@conference{Fang-2020-126838,author = {Zhaoyuan Fang and Jianren Wang and Hang Zhao},
title = {AlignNet: A Unifying Approach to Audio-Visual Alignment},
booktitle = {Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV '20)},
year = {2020},
month = {March},
pages = {3298 - 3306},
}
Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.