Statistical modeling for unit selection in speech synthesis

Cyril Allauzen, Mehryar Mohri, Michael Riley

Research output: Contribution to journalConference articlepeer-review


Traditional concatenative speech synthesis systems use a number of heuristics to define the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efficient system. We present a new unit selection system based on statistical modeling. To overcome the original absence of data, we use an existing high-quality unit selection system to generate a corpus of unit sequences. We show that the concatenation cost can be accurately estimated from this corpus using a statistical n-gram language model over units. We used weighted automata and transducers for the representation of the components of the system and designed a new and more efficient composition algorithm making use of string potentials for their combination. The resulting statistical unit selection is shown to be about 2.6 times faster than the last release of the AT&T Natural Voices Product while preserving the same quality, and offers much flexibility for the use and integration of new and more complex components.

Original languageEnglish (US)
Pages (from-to)55-62
Number of pages8
JournalProceedings of the Annual Meeting of the Association for Computational Linguistics
StatePublished - 2004
Event42nd Annual Meeting of the Association for Computational Linguistics, ACL 2004 - Barcelona, Spain
Duration: Jul 21 2004Jul 26 2004

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics


Dive into the research topics of 'Statistical modeling for unit selection in speech synthesis'. Together they form a unique fingerprint.

Cite this