Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming

U. Bub, M. Hunke, and Alex Waibel

Conference Paper, Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95), pp. 848 - 851, May, 1995

Abstract

With speech recognition systems steadily improving in performance, freedom from head-sets and push-buttons to activate the recognizer is one of the most important issues to achieve user acceptance. Microphone arrays and beamforming can deliver signals that suppress undesired jamming signals but rely on knowledge where the signal is in space. This knowledge is usually derived by identifying the loudest signal source. Knowing who is speaking to whom and where should however not depend on loudness, but on the communication purpose. In this paper, we present acoustic and visual modules that use tracking of the face of a speaker of interest for sound source localization and beamforming for signal extraction. It is shown that in noisy environments a more accurate localization in space can be delivered visually than acoustically. Given a reliable location finder, beamforming substantially improves recognition accuracy.

BibTeX

@conference{Bub-1995-16164,
author = {U. Bub and M. Hunke and Alex Waibel},
title = {Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming},
booktitle = {Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95)},
year = {1995},
month = {May},
pages = {848 - 851},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.