WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Ho Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ∼10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.

Original languageEnglish (US)
Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4563-4567
Number of pages5
ISBN (Electronic)9781665405409
DOIs
StatePublished - 2022
Event47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Virtual, Online, Singapore
Duration: May 23 2022May 27 2022

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2022-May
ISSN (Print)1520-6149

Conference

Conference47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
Country/TerritorySingapore
CityVirtual, Online
Period5/23/225/27/22

Keywords

  • Audio Captioning
  • Audio Classification
  • Audio Representation Learning
  • Audio Tagging
  • Cross-Modal Retrieval
  • Image Generation
  • Multimodal Learning
  • Self-Supervised Learning

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP'. Together they form a unique fingerprint.

Cite this