How to Listen? Rethinking Visual Sound Localization

Ho Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo Bello

Research output: Contribution to journalConference articlepeer-review


Localizing visual sounds consists of locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works were usually evaluated with datasets having mostly a single dominant visible object, and their proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications.

Original languageEnglish (US)
Pages (from-to)876-880
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - 2022
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: Sep 18 2022Sep 22 2022


  • acoustic event detection
  • acoustic scene understanding
  • audio-visual scene understanding
  • explainability
  • sound source localization

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation


Dive into the research topics of 'How to Listen? Rethinking Visual Sound Localization'. Together they form a unique fingerprint.

Cite this