TY - JOUR
T1 - Robust sound event detection in bioacoustic sensor networks
AU - Lostanlen, Vincent
AU - Salamon, Justin
AU - Farnsworth, Andrew
AU - Kelling, Steve
AU - Bello, Juan Pablo
N1 - Funding Information:
This research was supported by the National Science Foundation (grants 1633259 to JPB and 1633206 to SK and AF), the Leon Levy Foundation, and Google faculty awards to SK and JPB (https://ai.google/research/outreach/facultyresearch-awards/recipients/).We wish to thank Marc Delcroix, Holger Klinck, Peter Li, Richard F. Lyon, and Brian McFee for fruitful discussions. We thank Thomas Grill and Jan Schlüter for sharing the source code of their "bulbul" system for bird detection in audio signals. Lastly, we wish to thank the reviewers for their valuable comments and effort to improve the present manuscript.
Publisher Copyright:
© 2019 Lostanlen et al.
PY - 2019/10/1
Y1 - 2019/10/1
N2 - Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
AB - Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
UR - http://www.scopus.com/inward/record.url?scp=85074064328&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074064328&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0214168
DO - 10.1371/journal.pone.0214168
M3 - Article
C2 - 31647815
AN - SCOPUS:85074064328
VL - 14
JO - PLoS One
JF - PLoS One
SN - 1932-6203
IS - 10
M1 - e0214168
ER -