TY - GEN
T1 - Tricycle
T2 - 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019
AU - Cartwright, Mark
AU - Cramer, Jason
AU - Salamon, Justin
AU - Bello, Juan Pablo
N1 - Funding Information:
∗This work was partially supported by NSF awards 1544753 and 1633259.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/10
Y1 - 2019/10
N2 - Self-supervised representation learning with deep neural networks is a powerful tool for machine learning tasks with limited labeled data but extensive unlabeled data. To learn representations, self-supervised models are typically trained on a pretext task to predict structure in the data (e.g. audio-visual correspondence, short-term temporal sequence, word sequence) that is indicative of higher-level concepts relevant to a target, downstream task. Sensor networks are promising yet unexplored sources of data for self-supervised learning - they collect large amounts of unlabeled yet timestamped data over extended periods of time and typically exhibit long-term temporal structure (e.g., over hours, months, years) not observable at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in single-modal data and therefore could be exploited for self-supervision in many types of sensor networks. In this work, we present a model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network. We then demonstrate the utility of the learned audio representation in an urban sound event detection task with limited labeled data.
AB - Self-supervised representation learning with deep neural networks is a powerful tool for machine learning tasks with limited labeled data but extensive unlabeled data. To learn representations, self-supervised models are typically trained on a pretext task to predict structure in the data (e.g. audio-visual correspondence, short-term temporal sequence, word sequence) that is indicative of higher-level concepts relevant to a target, downstream task. Sensor networks are promising yet unexplored sources of data for self-supervised learning - they collect large amounts of unlabeled yet timestamped data over extended periods of time and typically exhibit long-term temporal structure (e.g., over hours, months, years) not observable at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in single-modal data and therefore could be exploited for self-supervision in many types of sensor networks. In this work, we present a model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network. We then demonstrate the utility of the learned audio representation in an urban sound event detection task with limited labeled data.
KW - audio embedding
KW - representation learning
KW - self-supervised learning
KW - sensor network
UR - http://www.scopus.com/inward/record.url?scp=85078028335&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078028335&partnerID=8YFLogxK
U2 - 10.1109/WASPAA.2019.8937265
DO - 10.1109/WASPAA.2019.8937265
M3 - Conference contribution
AN - SCOPUS:85078028335
T3 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
SP - 278
EP - 282
BT - 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 20 October 2019 through 23 October 2019
ER -