TY - GEN
T1 - A comparison of classifiers for detecting emotion from speech
AU - Shafran, Izhak
AU - Mohri, Mehryar
PY - 2005
Y1 - 2005
N2 - Accurate detection of emotion from speech has clear benefits for the design of more natural human-machine speech interfaces or for the extraction of useful information from large quantities of speech data. The task consists of assigning, out of a fixed set, an emotion category, e.g., anger, fear, or satisfaction, to a speech utterance. In recent work, several classifiers have been proposed for automatic detection of a speaker's emotion using spoken words as the input. These classifiers were designed independently and tested on separate corpora, making it difficult to compare their performance. This paper presents three classifiers, two popular classifiers from the literature modeling the word content via n-gram sequences, one based on an interpolated language model, another on a mutual information-based feature-selection approach, and compares them with a discriminant kernel-based technique that we recently adopted. We have implemented these three classification algorithms and evaluated their performance by applying them to a corpus collected from a spoken-dialog system that was widely deployed across the US. The results show that our kernel-based classifier achieves an accuracy of 80.6%, and out-performs both the interpolated language model classifier, which achieved a classification accuracy of 70.1%, and the classifier using mutual information-based feature selection (78.8%).
AB - Accurate detection of emotion from speech has clear benefits for the design of more natural human-machine speech interfaces or for the extraction of useful information from large quantities of speech data. The task consists of assigning, out of a fixed set, an emotion category, e.g., anger, fear, or satisfaction, to a speech utterance. In recent work, several classifiers have been proposed for automatic detection of a speaker's emotion using spoken words as the input. These classifiers were designed independently and tested on separate corpora, making it difficult to compare their performance. This paper presents three classifiers, two popular classifiers from the literature modeling the word content via n-gram sequences, one based on an interpolated language model, another on a mutual information-based feature-selection approach, and compares them with a discriminant kernel-based technique that we recently adopted. We have implemented these three classification algorithms and evaluated their performance by applying them to a corpus collected from a spoken-dialog system that was widely deployed across the US. The results show that our kernel-based classifier achieves an accuracy of 80.6%, and out-performs both the interpolated language model classifier, which achieved a classification accuracy of 70.1%, and the classifier using mutual information-based feature selection (78.8%).
UR - http://www.scopus.com/inward/record.url?scp=33646764533&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33646764533&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2005.1415120
DO - 10.1109/ICASSP.2005.1415120
M3 - Conference contribution
AN - SCOPUS:33646764533
SN - 0780388747
SN - 9780780388741
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - I341-I344
BT - 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '05 - Proceedings - Image and Multidimensional Signal Processing Multimedia Signal Processing
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '05
Y2 - 18 March 2005 through 23 March 2005
ER -