TY - GEN
T1 - EXPLORING MODALITY-AGNOSTIC REPRESENTATIONS for MUSIC CLASSIFICATION
AU - Wu, Ho Hsiang
AU - Fuentes, Magdalena
AU - Bello, Juan P.
N1 - Funding Information:
This work is partially supported by the National Science Foundation award #1544753. Magdalena Fuentes is a faculty fellow in the NYU Provost?s Postdoctoral Fellowship Program at the NYU Center for Urban Science and Progress and Music and Audio Research Laboratory.
Publisher Copyright:
Copyright: © 2021 the Authors.
PY - 2021
Y1 - 2021
N2 - Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
AB - Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
UR - http://www.scopus.com/inward/record.url?scp=85122058798&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122058798&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85122058798
T3 - Proceedings of the Sound and Music Computing Conferences
SP - 191
EP - 198
BT - SMC 2021 - Proceedings of the 18th Sound and Music Computing Conference
A2 - Mauro, Davide Andrea
A2 - Spagnol, Simone
A2 - Valle, Andrea
PB - Sound and music Computing network
T2 - 18th Sound and Music Computing Conference, SMC 2021
Y2 - 29 June 2021 through 1 July 2021
ER -