Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming - Robotics Institute Carnegie Mellon University

Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming

U. Bub, M. Hunke, and Alex Waibel
Conference Paper, Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95), pp. 848 - 851, May, 1995

Abstract

With speech recognition systems steadily improving in performance, freedom from head-sets and push-buttons to activate the recognizer is one of the most important issues to achieve user acceptance. Microphone arrays and beamforming can deliver signals that suppress undesired jamming signals but rely on knowledge where the signal is in space. This knowledge is usually derived by identifying the loudest signal source. Knowing who is speaking to whom and where should however not depend on loudness, but on the communication purpose. In this paper, we present acoustic and visual modules that use tracking of the face of a speaker of interest for sound source localization and beamforming for signal extraction. It is shown that in noisy environments a more accurate localization in space can be delivered visually than acoustically. Given a reliable location finder, beamforming substantially improves recognition accuracy.

BibTeX

@conference{Bub-1995-16164,
author = {U. Bub and M. Hunke and Alex Waibel},
title = {Knowing Who to Listen to in Speech Recognition: Visually Guided Beamforming},
booktitle = {Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95)},
year = {1995},
month = {May},
pages = {848 - 851},
}