TY - GEN
T1 - A Study on Robustness to Perturbations for Representations of Environmental Sound
AU - Srivastava, Sangeeta
AU - Wu, Ho Hsiang
AU - Rulff, Joao
AU - Fuentes, Magdalena
AU - Cartwright, Mark
AU - Silva, Claudio
AU - Arora, Anish
AU - Bello, Juan Pablo
N1 - Funding Information:
This work is partially supported by the NSF award 1544753.
Publisher Copyright:
© 2022 European Signal Processing Conference, EUSIPCO. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions - commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings - YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation.
AB - Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions - commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings - YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation.
KW - Self-supervised learning
KW - acoustic perturbations
KW - robust audio embeddings
KW - transfer learning
KW - urban sound
UR - http://www.scopus.com/inward/record.url?scp=85141010807&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141010807&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85141010807
T3 - European Signal Processing Conference
SP - 125
EP - 129
BT - 30th European Signal Processing Conference, EUSIPCO 2022 - Proceedings
PB - European Signal Processing Conference, EUSIPCO
T2 - 30th European Signal Processing Conference, EUSIPCO 2022
Y2 - 29 August 2022 through 2 September 2022
ER -