TY - GEN
T1 - Bridging High-Quality Audio and Video Via Language for Sound Effects Retrieval from Visual Queries
AU - Wilkins, Julia
AU - Salamon, Justin
AU - Fuentes, Magdalena
AU - Bello, Juan Pablo
AU - Nieto, Oriol
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audiovisual training data, previous work on audio-visual retrieval relies on YouTube ("in-the-wild") videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our companion website to listen to and view our SFX retrieval results.
AB - Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audiovisual training data, previous work on audio-visual retrieval relies on YouTube ("in-the-wild") videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our companion website to listen to and view our SFX retrieval results.
KW - Multimodal machine learning
KW - audio-visual representation learning
KW - cross-modal retrieval
KW - data augmentation
UR - http://www.scopus.com/inward/record.url?scp=85172984846&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85172984846&partnerID=8YFLogxK
U2 - 10.1109/WASPAA58266.2023.10248113
DO - 10.1109/WASPAA58266.2023.10248113
M3 - Conference contribution
AN - SCOPUS:85172984846
T3 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
BT - Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2023
Y2 - 22 October 2023 through 25 October 2023
ER -