Abstract
Concatenative Text-to-Speech (TTS) systems such as those described by Hunt and Black [6] can select at synthesis time from a very large number of recorded units. The selected units are chosen to minimize a combination of target and join costs for a given sentence. However, the join costs, in particular, can be quite expensive to compute, even when this computation has been optimized. If possible, we would avoid this computation by precomputing and caching all the possible join costs, but their number is prohibitive. Although the search space of possible joins is large, we have found that only a small fraction are selected in practice. By synthesizing a large quantity of text and logging the units actually selected, we were able to gather usage statistics and construct a practical and efficient cache of concatenation costs. Use of this cache dramatically decreased the runtime of the AT&T Next-Generation TTS system [1] with negligible effect on speech quality. Experiments show that by caching 0.7% of the possible joins, 99% of the join cost computations can be avoided.
Original language | English (US) |
---|---|
Pages | 607-610 |
Number of pages | 4 |
State | Published - 1999 |
Event | 6th European Conference on Speech Communication and Technology, EUROSPEECH 1999 - Budapest, Hungary Duration: Sep 5 1999 → Sep 9 1999 |
Conference
Conference | 6th European Conference on Speech Communication and Technology, EUROSPEECH 1999 |
---|---|
Country/Territory | Hungary |
City | Budapest |
Period | 9/5/99 → 9/9/99 |
ASJC Scopus subject areas
- Computer Science Applications
- Software
- Linguistics and Language
- Communication