TY - GEN
T1 - Voice signatures
AU - Shafran, Izhak
AU - Riley, Michael
AU - Mohri, Mehryar
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - Most current spoken-dialog systems only extract sequences of words from a speaker's voice. This largely ignores other useful information that can be inferred from speech such as gender, age, dialect, or emotion. These characteristics of a speaker's voice, voice signatures, whether static or dynamic, can be useful for speech mining applications or for the design of a natural spoken-dialog system. This paper explores the problem of extracting automatically and accurately voice signatures from a speaker's voice. We investigate two approaches for extracting speaker traits: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker. In the first approach, we show that standard speech/non-speech HMMs, conditioned on speaker traits and evaluated on cepstral and pitch features, achieve accuracies well above chance for all examined traits. The second approach, using support vector machines with rational kernels applied to speech recognition lattices, attains an accuracy of about 81% in the task of binary classification of emotion. Our results are based on a corpus of speech data collected from a deployed customer-care application (HMIHY 0300). While still preliminary, our results are significant and show that voice signatures are of practical interest in real-world applications.
AB - Most current spoken-dialog systems only extract sequences of words from a speaker's voice. This largely ignores other useful information that can be inferred from speech such as gender, age, dialect, or emotion. These characteristics of a speaker's voice, voice signatures, whether static or dynamic, can be useful for speech mining applications or for the design of a natural spoken-dialog system. This paper explores the problem of extracting automatically and accurately voice signatures from a speaker's voice. We investigate two approaches for extracting speaker traits: the first focuses on general acoustic and prosodic features, the second on the choice of words used by the speaker. In the first approach, we show that standard speech/non-speech HMMs, conditioned on speaker traits and evaluated on cepstral and pitch features, achieve accuracies well above chance for all examined traits. The second approach, using support vector machines with rational kernels applied to speech recognition lattices, attains an accuracy of about 81% in the task of binary classification of emotion. Our results are based on a corpus of speech data collected from a deployed customer-care application (HMIHY 0300). While still preliminary, our results are significant and show that voice signatures are of practical interest in real-world applications.
UR - http://www.scopus.com/inward/record.url?scp=84946723550&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84946723550&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2003.1318399
DO - 10.1109/ASRU.2003.1318399
M3 - Conference contribution
AN - SCOPUS:84946723550
T3 - 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003
SP - 31
EP - 36
BT - 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003
Y2 - 30 November 2003 through 4 December 2003
ER -