Transcribing Multilingual Broadcast News Using Hypothesis Driven Lexical Adaptation
This paper describes first results of our DARPA-sponsored efforts toward recognizing and browsing foreign language, more specifically, Serbo-Croatian broadcast news. For Serbo-Croatian as well as many other than the most common well studied languages, the problems of broadcast quality recognition are complicated by 1.) the lack of available acoustic and language data, and 2.) the excessive vocabulary growth in heavily inflected languages that lead to unacceptable OOV-rates. We present a Serbo-Croatian large vocabulary system that achieves a 74 % recognition rate, despite limited training data. Our system achieves this rate by a multipass strategy that dynamically adapts the recognition dictionary to the speech segment to be recognized by generating morphological variations (Hypothesis Driven Lexical Adaptation). We will outline the bootstrapping and training process of the Janus Recognition Toolkit (JanusRTk) based broadcast news recognition engine: data collection, segmentation and labeling of the data according to different acoustic conditions, dictionary design, language modeling and training. The Hypothesis Driven Lexical Adaptation (HDLA) approach has been tested both on Serbo-Croatian and German news data and has achieved considerable recognition improvements. OOV-rates were reduced by 35-45%; on the Serbo-Croatian broadcast news data from 8.7 % to 4.8 % thereby also decreasing word error rate from 29.5 % to 26%.
@workshop{Geutner-1998-16545,author = {Petra Geutner and and Peter Scheytt and Alex Waibel and Howard Wactlar},
title = {Transcribing Multilingual Broadcast News Using Hypothesis Driven Lexical Adaptation},
booktitle = {Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (BNTUW '98)},
year = {1998},
month = {February},