Skip to main content
Log in

DENet: a deep architecture for audio surveillance applications

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In the last years, a big interest of both the scientific community and the market has been devoted to the design of audio surveillance systems, able to analyse the audio stream and to identify events of interest; this is particularly true in security applications, in which the audio analytics can be profitably used as an alternative to video analytics systems, but also combined with them. Within this context, in this paper we propose a novel recurrent convolutional neural network architecture, named DENet; it is based on a new layer that we call denoising-enhancement (DE) layer, which performs denoising and enhancement of the original signal by applying an attention map on the components of the band-filtered signal. Differently from state-of-the-art methodologies, DENet takes as input the lossless raw waveform and is able to automatically learn the evolution of the frequencies-of-interest over time, by combining the proposed layer with a bidirectional gated recurrent unit. Using the feedbacks coming from classifications related to consecutive frames (i.e. that belong to the same event), the proposed method is able to drastically reduce the misclassifications. We carried out experiments on the MIVIA Audio Events and MIVIA Road Events public datasets, confirming the effectiveness of our approach with respect to other state-of-the-art methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability statement

The authors do not provide supplementary data and material.

Code availability

The code is available at: https://github.com/MiviaLab/DENet.

References

  1. Abdoli S, Cardinal P, Koerich AL (2019) End-to-end environmental sound classification using a 1d convolutional neural network. Expert Syst Appl 136:252–263. https://doi.org/10.1016/j.eswa.2019.06.040

    Article  Google Scholar 

  2. Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Esesn BCV, Awwal AAS, Asari VK (2018) The history began from alexnet: a comprehensive survey on deep learning approaches. https://arxiv.org/abs/1803.01164

  3. Auger F, Flandrin P (1995) Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans Signal Process 43(5):1068–1089

    Article  Google Scholar 

  4. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900

  5. Carletti V, Foggia P, Percannella G, Saggese A, Strisciuglio N, Vento M (2013) Audio surveillance using a bag of aural words classifier. In: IEEE international conference on advanced video and signal based surveillance (AVSS), pp 81–86. https://doi.org/10.1109/avss.2013.6636620

  6. Crocco M, Cristani M, Trucco A, Murino V (2016) Audio surveillance: a systematic review. ACM Comput Surv CSUR 48(4):1–46

    Article  Google Scholar 

  7. Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M (2015) Reliable detection of audio events in highly noisy environments. Pattern Recognit Lett 65:22–28. https://doi.org/10.1016/j.patrec.2015.06.026

    Article  Google Scholar 

  8. Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M (2016) Audio surveillance of roads: a system for detecting anomalous sounds. IEEE Trans Intell Transp Syst 17(1):279–288. https://doi.org/10.1109/tits.2015.2470216

    Article  Google Scholar 

  9. Foggia P, Saggese A, Strisciuglio N, Vento M, Petkov N (2015) Car crashes detection by audio analysis in crowded roads. In: 2015 12th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 1–6. IEEE. https://doi.org/10.1109/avss.2015.7301731

  10. Foggia P, Saggese A, Strisciuglio N, Vento M, Vigilante V (2019) Detecting sounds of interest in roads with deep networks. In: Ricci E, Rota Bulò S, Snoek C, Lanz O, Messelodi S, Sebe N (eds) Image analysis and processing—ICIAP 2019, pp 583–592. Springer International Publishing, Cham

  11. Furui S (1986) Speaker-independent isolated word recognition based on emphasized spectral dynamics. In: ICASSP’86. IEEE international conference on acoustics, speech, and signal processing, vol 11, pp 1991–1994. IEEE

  12. Greco A, Petkov N, Saggese A, Vento M (2020) AReN: a deep learning approach for sound event recognition using a brain inspired representation. IEEE Trans Inf Forensics Secur 15:3610–3624. https://doi.org/10.1109/tifs.2020.2994740

    Article  Google Scholar 

  13. Greco A, Saggese A, Vento M, Vigilante V (2019) SoReNet: a novel deep network for audio surveillance applications. In: 2019 IEEE international conference on systems, man and cybernetics (SMC), pp 546–551. IEEE. https://doi.org/10.1109/smc.2019.8914435

  14. Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss RJ, Wilson K (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 131–135

  15. Kim T, Lee J, Nam J (2019) Comparison and analysis of sample CNN architectures for audio classification. IEEE J Sel Top Signal Process 13(2):285–297

    Article  Google Scholar 

  16. Kumar P, Mittal A, Kumar P (2008) A multimodal framework using audio, visible and infrared imagery for surveillance and security applications. Int J Signal Imaging Syst Eng 1(3/4):255. https://doi.org/10.1504/ijsise.2008.026797

    Article  Google Scholar 

  17. Leng YR, Tran HD, Kitaoka N, Li H (2010) Selective gammatone filterbank feature for robust sound event recognition. In: Eleventh annual conference of the international speech communication association

  18. Li J, Dai W, Metze F, Qu S, Das S (2017) A comparison of deep learning methods for environmental sound detection. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 126–130. IEEE. https://doi.org/10.1109/icassp.2017.7952131

  19. Mathur A, Isopoussu A, Kawsar F, Berthouze N, Lane ND (2019) Mic2Mic: Using cycle-consistent generative adversarial networks to overcome microphone variability in speech systems. In: Proceedings of the 18th international conference on information processing in sensor networks, pp 169–180

  20. Nooralahiyan AY, Lopez L, Mckewon D, Ahmadi M (1997) Time-delay neural network for audio monitoring of road traffic and vehicle classification. In: Transportation sensors and controls: collision avoidance, traffic management, and ITS, vol 2902, pp 193–200. International Society for Optics and Photonics. https://doi.org/10.1117/12.267145

  21. Purwins H, Li B, Virtanen T, Schlüter J, Chang SY, Sainath T (2019) Deep learning for audio signal processing. IEEE J Sel Top Signal Process 13(2):206–219

    Article  Google Scholar 

  22. Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with sincnet. In: 2018 IEEE spoken language technology workshop (SLT). IEEE. https://doi.org/10.1109/slt.2018.8639585

  23. Roberto A, Saggese A, Vento M (2020) A deep convolutionary network for automatic detection of audio events. In: International conference on applications of intelligent systems (APPIS). https://doi.org/10.1145/3378184.3378186

  24. Saggese A, Strisciuglio N, Vento M, Petkov N (2016) Time-frequency analysis for audio event detection in real scenarios. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 438–443. IEEE. https://doi.org/10.1109/avss.2016.7738082

  25. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681. https://doi.org/10.1109/78.650093

    Article  Google Scholar 

  26. Strisciuglio N, Vento M, Petkov N (2019) Learning representations of sound using trainable COPE feature extractors. Pattern Recognit 92:25–36. https://doi.org/10.1016/j.patcog.2019.03.016

    Article  Google Scholar 

  27. Torrey L, Shavlik J (2010) Transfer learning. In: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp 242–264. IGI Global

  28. Valera M, Velastin SA (2005) Intelligent distributed surveillance systems: a review. IEE Proc Vis Image Signal Process 152(2):192–204

    Article  Google Scholar 

  29. Wan T, Zhou Y, Ma Y, Liu H (2019) Noise robust sound event detection using deep learning and audio enhancement. In: 2019 IEEE international symposium on signal processing and information technology (ISSPIT), pp 1–5. IEEE

  30. Wei P, He F, Li L, Li J (2020) Research on sound classification based on SVM. Neural Comput Appl 32(6):1593–1607

    Article  Google Scholar 

  31. Zhang H, McLoughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 559–563. https://doi.org/10.1109/icassp.2015.7178031

Download references

Funding

The authors do not receive funding for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessia Saggese.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Greco, A., Roberto, A., Saggese, A. et al. DENet: a deep architecture for audio surveillance applications. Neural Comput & Applic 33, 11273–11284 (2021). https://doi.org/10.1007/s00521-020-05572-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05572-5

Keywords

Navigation