Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube

Lloyd May, Keita Ohshiro, Khang Dang, Sripathi Sridhar, Jhanvi Pai, Magdalena Fuentes, Sooyeon Lee, Mark Cartwright

Research output: Chapter in Book/Report/Conference proceedingConference contribution


High-quality closed captioning of both speech and non-speech elements (e.g., music, sound effects, manner of speaking, and speaker identification) is essential for the accessibility of video content, especially for d/Deaf and hard-of-hearing individuals. While many regions have regulations mandating captioning for television and movies, a regulatory gap remains for the vast amount of web-based video content, including the staggering 500+ hours uploaded to YouTube every minute. Advances in automatic speech recognition have bolstered the presence of captions on YouTube. However, the technology has notable limitations, including the omission of many non-speech elements, which are often crucial for understanding content narratives. This paper examines the contemporary and historical state of non-speech information (NSI) captioning on YouTube through the creation and exploratory analysis of a dataset of over 715k videos. We identify factors that influence NSI caption practices and suggest avenues for future research to enhance the accessibility of online video content.

Original languageEnglish (US)
Title of host publicationCHI 2024 - Proceedings of the 2024 CHI Conference on Human Factors in Computing Sytems
PublisherAssociation for Computing Machinery
ISBN (Electronic)9798400703300
StatePublished - May 11 2024
Event2024 CHI Conference on Human Factors in Computing Sytems, CHI 2024 - Hybrid, Honolulu, United States
Duration: May 11 2024May 16 2024

Publication series

NameConference on Human Factors in Computing Systems - Proceedings


Conference2024 CHI Conference on Human Factors in Computing Sytems, CHI 2024
Country/TerritoryUnited States
CityHybrid, Honolulu


  • closed captioning
  • datasets
  • extra-speech information
  • non-speech information
  • subtitles

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Graphics and Computer-Aided Design
  • Software


Dive into the research topics of 'Unspoken Sound: Identifying Trends in Non-Speech Audio Captioning on YouTube'. Together they form a unique fingerprint.

Cite this