TY - GEN
T1 - Careless Whisper
T2 - 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024
AU - Koenecke, Allison
AU - Choi, Anna Seo Gyeong
AU - Mei, Katelyn X.
AU - Schellmann, Hilke
AU - Sloane, Mona
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/6/3
Y1 - 2024/6/3
N2 - Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations - a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.
AB - Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations - a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.
KW - Algorithmic Fairness
KW - Automated Speech Recognition
KW - Generative AI
KW - Thematic Coding
UR - http://www.scopus.com/inward/record.url?scp=85196648475&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196648475&partnerID=8YFLogxK
U2 - 10.1145/3630106.3658996
DO - 10.1145/3630106.3658996
M3 - Conference contribution
AN - SCOPUS:85196648475
T3 - 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024
SP - 1672
EP - 1681
BT - 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024
PB - Association for Computing Machinery, Inc
Y2 - 3 June 2024 through 6 June 2024
ER -