Characteristic Prosody and Meaning in Speech Synthesis - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Proposal

January

16
Thu
Kevin A. Lenzo Carnegie Mellon University
Thursday, January 16
10:00 am to 12:00 am
Characteristic Prosody and Meaning in Speech Synthesis

Event Location: NSH 3305

Abstract: A new trainable model and method for generating prosody for speech synthesis is proposed, in which a relationship between pragmatics, intonational phonology, and performance is made explicit in a language-neutral manner. The literature of pragmatics and intonation is broad, but produces only coarse descriptions and few actionable models (Steedman; Pierrehumbert and Hirschberg) and no speaker characteristics or relationship to fine-grained performance; that of laboratory phonology demonstrates tones and types, but does not provide a generative relationship from meaning (e.g. ToBi, Pierrehumbert and Beckman), and points more to variations of individual events in performance than production. Performance models, such as Tilt and RFC (Taylor), the Fujisaki model (Fujisaki), and Jinton (van Santen) provide no relationship back through to meaning.

The novel approach here is to produce a trainable, two-stage model, which relates the performance of a speaker to both the intonational events, and to their assignment via a set of pragmatic tags or rhetorical functions. Function approximations are trained to approximate ToBI-like events and their realization in fundamental frequency (intonation or F0), energy, and duration; these are then related to a second set of operators which assign constellations of events from the pragmatic mark-up. The effects to be modeled at the pragmatic level have different realizations in different languages, but use the same high-level tags. The benefit is to allow authors to mark up the text for meaning, rather than resorting to low-level commands, in order to create expressive synthesis in the style of a given speaker.

While the number of uses of prosody for meaning are numerous and varied, the approach is to create corpora with more context, so that the talker will express one known meaning structure for each prompt, rather than reading in isolation; this requires specially crafted corpora and instructions. The effects that will be modeled include: attachment (bracketing), lists (exhaustive and inexhaustive), uncertainty/deferral, and focus/kontrast (as in Vallduví and Vilkuna).

Committee:Alan W. Black, Chair

Jack Mostow

Alexander Rudnicky

Julia Hirschberg, Columbia University