Prosodic analysis and modelling of conversational elements for speech synthesis

  1. Adell Mercado, Jordi
Supervised by:
  1. Antonio Bonafonte Cávez Director
  2. David Escudero Mancebo Co-director

Defence university: Universitat Politècnica de Catalunya (UPC)

Fecha de defensa: 22 July 2009

  1. María Asunción Moreno Bilbao Chair
  2. Ignacio Iriondo Sanz Secretary
  3. Juan María Garrido Almiñana Committee member
  4. Francisco Campillo Díaz Committee member
  5. Inmaculada Hernáez Rioja Committee member

Type: Thesis

Currently, speech synthesisers sound natural when reading text aloud. However, there is an increasing interest in more expressive speech synthesisers for many new applications. Giving the synthesiser the ability to produce speech in a variety of styles is a way to gain this expressiveness. Speech in conversations is the most widely used type of speech. Therefore, more than other aspects, Text-to-Speech systems do need to deal with elements that differentiate between reading and conversation. This thesis focuses on describing the characteristic elements of conversational speech and proposing models to synthesise them. Conversational speech differs significantly from reading aloud due to the inclusion of a variety of rosodic resources that affect the rhythm of the utterances. These resources are, in fact, strategies to achieve communicative tasks between the speaker and the addressee. One type of strategy is disfluency. They are very frequent in normal speech, and in contrast to what one might expect, they carry information and are useful for human communication. Other characteristic elements of conversational speech are closed defined phrases that are repeated through speech and that carry affective rather than propositional information. I present an analysis of three different corpora. One is very similar to the most widely used corpora in the literature for the study of disfluencies. It consists of a set of conversations. Another one was recorded for machine translation and consists of a set of speeches. The third one was recorded from ill-formed sentences. These are sentences generated by automatic systems with incorrect syntactic structures. The corpus with the highest number of disfluencies is the conversational one. This sup ports the claim that disfluencies are not related to problems in speaking or planning the discourse, but are inherent to conversations. Disfluencies are disruptions of the speech flow, and they can be seen as discontinuities. It is reasonable to assume that the prosody will be different from that expected for the same elements within a fluent sentence only around such disconti nuities. Speech is fluent between discontinuities; thus, we can define regions within which speech is fluent and name them fluency regions. Following the previous hypothesis, considering that the elements in a disfluency are fluent within the correct context, it implies there exists a sentence where we could insert an element of a disfluency and the sentence is still erceived as fluent. We refer to this sentence as the underlying fluent sentence. The use of this assumption is supported by studying the impact of disfluencies in a sentence on the speech rate and pitch. Findings show that the inclusion of disfluencies only generates local modifications around fluency boundaries with respect to the underlying fluent sentences. The prosodic parameters that better describe fluency boundaries have been identified for each type of conversational elements: hesitation, repetitions, filled pauses and wrappers. Regression models are proposed to predict the prosody of these elements. The approach proposed in this thesis, which relies on the underlying fluent sentence assumption, can take advantage of prosodic models trained on fluent speech. Thus, there is no need to re-train the prosodic models.