DENet: a deep architecture for audio surveillance applications

Greco, Antonio; Roberto, Antonio; Saggese, Alessia; Vento, Mario

doi:10.1007/s00521-020-05572-5

DENet: a deep architecture for audio surveillance applications

Original Article
Published: 11 January 2021

Volume 33, pages 11273–11284, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Antonio Greco¹,
Antonio Roberto¹,
Alessia Saggese ORCID: orcid.org/0000-0003-4687-7994¹ &
…
Mario Vento¹

1275 Accesses
6 Citations
4 Altmetric
Explore all metrics

Abstract

In the last years, a big interest of both the scientific community and the market has been devoted to the design of audio surveillance systems, able to analyse the audio stream and to identify events of interest; this is particularly true in security applications, in which the audio analytics can be profitably used as an alternative to video analytics systems, but also combined with them. Within this context, in this paper we propose a novel recurrent convolutional neural network architecture, named DENet; it is based on a new layer that we call denoising-enhancement (DE) layer, which performs denoising and enhancement of the original signal by applying an attention map on the components of the band-filtered signal. Differently from state-of-the-art methodologies, DENet takes as input the lossless raw waveform and is able to automatically learn the evolution of the frequencies-of-interest over time, by combining the proposed layer with a bidirectional gated recurrent unit. Using the feedbacks coming from classifications related to consecutive frames (i.e. that belong to the same event), the proposed method is able to drastically reduce the misclassifications. We carried out experiments on the MIVIA Audio Events and MIVIA Road Events public datasets, confirming the effectiveness of our approach with respect to other state-of-the-art methodologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Surveillance of Highway Traffic Events by Deep Learning Architectures

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

Article 06 February 2024

CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks

Article 20 August 2020

Data availability statement

The authors do not provide supplementary data and material.

Code availability

The code is available at: https://github.com/MiviaLab/DENet.

References

Abdoli S, Cardinal P, Koerich AL (2019) End-to-end environmental sound classification using a 1d convolutional neural network. Expert Syst Appl 136:252–263. https://doi.org/10.1016/j.eswa.2019.06.040
Article Google Scholar
Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Esesn BCV, Awwal AAS, Asari VK (2018) The history began from alexnet: a comprehensive survey on deep learning approaches. https://arxiv.org/abs/1803.01164
Auger F, Flandrin P (1995) Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans Signal Process 43(5):1068–1089
Article Google Scholar
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900
Carletti V, Foggia P, Percannella G, Saggese A, Strisciuglio N, Vento M (2013) Audio surveillance using a bag of aural words classifier. In: IEEE international conference on advanced video and signal based surveillance (AVSS), pp 81–86. https://doi.org/10.1109/avss.2013.6636620
Crocco M, Cristani M, Trucco A, Murino V (2016) Audio surveillance: a systematic review. ACM Comput Surv CSUR 48(4):1–46
Article Google Scholar
Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M (2015) Reliable detection of audio events in highly noisy environments. Pattern Recognit Lett 65:22–28. https://doi.org/10.1016/j.patrec.2015.06.026
Article Google Scholar
Foggia P, Petkov N, Saggese A, Strisciuglio N, Vento M (2016) Audio surveillance of roads: a system for detecting anomalous sounds. IEEE Trans Intell Transp Syst 17(1):279–288. https://doi.org/10.1109/tits.2015.2470216
Article Google Scholar
Foggia P, Saggese A, Strisciuglio N, Vento M, Petkov N (2015) Car crashes detection by audio analysis in crowded roads. In: 2015 12th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 1–6. IEEE. https://doi.org/10.1109/avss.2015.7301731
Foggia P, Saggese A, Strisciuglio N, Vento M, Vigilante V (2019) Detecting sounds of interest in roads with deep networks. In: Ricci E, Rota Bulò S, Snoek C, Lanz O, Messelodi S, Sebe N (eds) Image analysis and processing—ICIAP 2019, pp 583–592. Springer International Publishing, Cham
Furui S (1986) Speaker-independent isolated word recognition based on emphasized spectral dynamics. In: ICASSP’86. IEEE international conference on acoustics, speech, and signal processing, vol 11, pp 1991–1994. IEEE
Greco A, Petkov N, Saggese A, Vento M (2020) AReN: a deep learning approach for sound event recognition using a brain inspired representation. IEEE Trans Inf Forensics Secur 15:3610–3624. https://doi.org/10.1109/tifs.2020.2994740
Article Google Scholar
Greco A, Saggese A, Vento M, Vigilante V (2019) SoReNet: a novel deep network for audio surveillance applications. In: 2019 IEEE international conference on systems, man and cybernetics (SMC), pp 546–551. IEEE. https://doi.org/10.1109/smc.2019.8914435
Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss RJ, Wilson K (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 131–135
Kim T, Lee J, Nam J (2019) Comparison and analysis of sample CNN architectures for audio classification. IEEE J Sel Top Signal Process 13(2):285–297
Article Google Scholar
Kumar P, Mittal A, Kumar P (2008) A multimodal framework using audio, visible and infrared imagery for surveillance and security applications. Int J Signal Imaging Syst Eng 1(3/4):255. https://doi.org/10.1504/ijsise.2008.026797
Article Google Scholar
Leng YR, Tran HD, Kitaoka N, Li H (2010) Selective gammatone filterbank feature for robust sound event recognition. In: Eleventh annual conference of the international speech communication association
Li J, Dai W, Metze F, Qu S, Das S (2017) A comparison of deep learning methods for environmental sound detection. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 126–130. IEEE. https://doi.org/10.1109/icassp.2017.7952131
Mathur A, Isopoussu A, Kawsar F, Berthouze N, Lane ND (2019) Mic2Mic: Using cycle-consistent generative adversarial networks to overcome microphone variability in speech systems. In: Proceedings of the 18th international conference on information processing in sensor networks, pp 169–180
Nooralahiyan AY, Lopez L, Mckewon D, Ahmadi M (1997) Time-delay neural network for audio monitoring of road traffic and vehicle classification. In: Transportation sensors and controls: collision avoidance, traffic management, and ITS, vol 2902, pp 193–200. International Society for Optics and Photonics. https://doi.org/10.1117/12.267145
Purwins H, Li B, Virtanen T, Schlüter J, Chang SY, Sainath T (2019) Deep learning for audio signal processing. IEEE J Sel Top Signal Process 13(2):206–219
Article Google Scholar
Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with sincnet. In: 2018 IEEE spoken language technology workshop (SLT). IEEE. https://doi.org/10.1109/slt.2018.8639585
Roberto A, Saggese A, Vento M (2020) A deep convolutionary network for automatic detection of audio events. In: International conference on applications of intelligent systems (APPIS). https://doi.org/10.1145/3378184.3378186
Saggese A, Strisciuglio N, Vento M, Petkov N (2016) Time-frequency analysis for audio event detection in real scenarios. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 438–443. IEEE. https://doi.org/10.1109/avss.2016.7738082
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681. https://doi.org/10.1109/78.650093
Article Google Scholar
Strisciuglio N, Vento M, Petkov N (2019) Learning representations of sound using trainable COPE feature extractors. Pattern Recognit 92:25–36. https://doi.org/10.1016/j.patcog.2019.03.016
Article Google Scholar
Torrey L, Shavlik J (2010) Transfer learning. In: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp 242–264. IGI Global
Valera M, Velastin SA (2005) Intelligent distributed surveillance systems: a review. IEE Proc Vis Image Signal Process 152(2):192–204
Article Google Scholar
Wan T, Zhou Y, Ma Y, Liu H (2019) Noise robust sound event detection using deep learning and audio enhancement. In: 2019 IEEE international symposium on signal processing and information technology (ISSPIT), pp 1–5. IEEE
Wei P, He F, Li L, Li J (2020) Research on sound classification based on SVM. Neural Comput Appl 32(6):1593–1607
Article Google Scholar
Zhang H, McLoughlin I, Song Y (2015) Robust sound event recognition using convolutional neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 559–563. https://doi.org/10.1109/icassp.2015.7178031

Download references

Funding

The authors do not receive funding for this research.

Author information

Authors and Affiliations

University of Salerno, Via Giovanni Paolo II 132, Fisciano, SA, Italy
Antonio Greco, Antonio Roberto, Alessia Saggese & Mario Vento

Authors

Antonio Greco
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Roberto
View author publications
You can also search for this author in PubMed Google Scholar
Alessia Saggese
View author publications
You can also search for this author in PubMed Google Scholar
Mario Vento
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessia Saggese.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Greco, A., Roberto, A., Saggese, A. et al. DENet: a deep architecture for audio surveillance applications. Neural Comput & Applic 33, 11273–11284 (2021). https://doi.org/10.1007/s00521-020-05572-5

Download citation

Received: 11 July 2020
Accepted: 01 December 2020
Published: 11 January 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s00521-020-05572-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DENet: a deep architecture for audio surveillance applications

Abstract

Access this article

Similar content being viewed by others

Video Surveillance of Highway Traffic Events by Deep Learning Architectures

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks

Data availability statement

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DENet: a deep architecture for audio surveillance applications

Abstract

Access this article

Similar content being viewed by others

Video Surveillance of Highway Traffic Events by Deep Learning Architectures

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks

Data availability statement

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation