TY - JOUR
T1 - Multi-task self-supervised pre-training for music classification
AU - Wu, Ho Hsiang
AU - Kao, Chieh Chi
AU - Tang, Qingming
AU - Sun, Ming
AU - McFee, Brian
AU - Bello, Juan Pablo
AU - Wang, Chao
N1 - Funding Information:
★ Work done at Amazon This work is partially supported by the National Science Foundation award #1544753
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply selfsupervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks.
AB - Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply selfsupervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks.
KW - Multi-task learning
KW - Music classification
KW - Self-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85111236815&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85111236815&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414405
DO - 10.1109/ICASSP39728.2021.9414405
M3 - Conference article
AN - SCOPUS:85111236815
SN - 1520-6149
VL - 2021-June
SP - 556
EP - 560
JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -