TY - GEN
T1 - Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition
AU - Hamed, Injy
AU - Hussein, Amir
AU - Chellah, Oumnia
AU - Chowdhury, Shammur
AU - Mubarak, Hamdy
AU - Sitaram, Sunayana
AU - Habash, Nizar
AU - Ali, Ahmed
N1 - Funding Information:
The work presented here was carried out during the 2022 Jelinek Memorial Summer Workshop on Speech and Language Technologies at Johns Hopkins University, which was supported with funding from Amazon, Microsoft and Google. We also thank the anonymous reviewers for their helpful feedback.
Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
AB - Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
KW - ASR
KW - Code-switching
KW - Evaluation metric
UR - http://www.scopus.com/inward/record.url?scp=85147793797&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147793797&partnerID=8YFLogxK
U2 - 10.1109/SLT54892.2023.10023181
DO - 10.1109/SLT54892.2023.10023181
M3 - Conference contribution
AN - SCOPUS:85147793797
T3 - 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings
SP - 999
EP - 1005
BT - 2022 IEEE Spoken Language Technology Workshop, SLT 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE Spoken Language Technology Workshop, SLT 2022
Y2 - 9 January 2023 through 12 January 2023
ER -