Hand-crafted versus learned representations for audio event detection

Küçükbay, Selver Ezgi; Yazıcı, Adnan; Kalkan, Sinan

doi:10.1007/s11042-022-12873-5

Hand-crafted versus learned representations for audio event detection

Published: 07 April 2022

Volume 81, pages 30911–30930, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Selver Ezgi Küçükbay¹,
Adnan Yazıcı^1,2 &
Sinan Kalkan¹

343 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (\(\sim \)30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of deep learning techniques in audio event recognition (AER) applications

Article 14 June 2023

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Article Open access 14 September 2015

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Article Open access 17 June 2019

References

Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video. In: Advances in neural information processing systems
Becker S, Ackermann M, Lapuschkin S, Müller K-R, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. CoRR
Çakir E, Parascandolo G, Heittola T, Huttunen H, Virtanen T (2017) Convolutional recurrent neural networks for polyphonic sound event detection. CoRR
Çakir E, Virtanen T (2018) End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input. CoRR
Dai W, Dai C, Qu S, Li J, Das S (2016) Very deep convolutional neural networks for raw waveforms
Dinkel H, Qian Y, Yu K P (2018) A hybrid asr model approach on weakly labeled scene classification
Eutizi C, Benedetto F (2021) On the performance improvements of deep learning methods for audio event detection and classification. In: 2021 44th International Conference on Telecommunications and Signal Processing (TSP), pp 141–145
Fonseca E, Ortego D, McGuinness K, O’Connor N E, Serra X (2020) Unsupervised contrastive learning of sound event representations
Gemmeke J F, Ellis D P W, Freedman D, Jansen A, Lawrence W, Moore R C, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 776–780
Giannakopoulos T, Spyrou E, Perantonis S (2019) Recognition of urban sound events using deep context-aware feature extractors and handcrafted features. Int. Conf. on Artificial Intelligence Applications and Innovations
Hershey S, Chaudhuri S, Ellis D P W, Gemmeke J F, Jansen A, Moore R C, Plakal M, Platt D, Saurous R A, Seybold B, Slaney M, Weiss R J, Wilson K W (2016) CNN architectures for large-scale audio classification. CoRR
Kayser M, Zhong V (2015) Denoising convolutional autoencoders for noisy speech recognition. Technical Report, CS231 Standford Reports
Kong Q, Cao Y, Iqbal T, Wang Y, Wang W, Plumbley M D (2020) Panns: Large-scale pretrained audio neural networks for audio pattern recognition
Kothinti S, Sell G, Watanabe S, Elhilali M (2019) Integrated bottom-up and top-down inference for sound event detection. Technical Report, Department of Electrical and Computer Engineering. Johns Hopkins University, Baltimore
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Kwak J-Y, Chung Y-J (2020) Sound event detection using derivative features in deep neural networks. Appl Sci 10:4911. https://doi.org/10.3390/app10144911
Article Google Scholar
Lee J, Kim T, Park J, Nam J (2017) Raw waveform-based audio classification using sample-level cnn architectures. 31st Conf. on Neural Information Processing Systems (NIPS)
Lefèvre S, Vincent N (2011) A two level strategy for audio segmentation. Digit Signal Process 21(2):270–277
Article Google Scholar
Li J, Dai W, Metze F, Qu S, Das S (2017) A comparison of deep learning methods for environmental sound. CoRR
Liu H, Zhang S (2012) Noisy data elimination using mutual k-nearest neighbor for classification mining. J Syst Softw 85(5):1067–1074
Article Google Scholar
Maas A, Le Q, neil T, Vinyals O, Nguyen P, Ng A (2012) Recurrent neural networks for noise reduction in robust asr. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1
Maria A, Jeyaseelan A S (2021) Development of optimal feature selection and deep learning toward hungry stomach detection using audio signals. J Control Autom Electr Syst 32
Mesaros A, Heittola T, Virtanen T (2016) Tut database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp 1128–1132
Mesaros A, Heittola T, Virtanen T, Plumbley M D (2021) Sound event detection: a tutorial. IEEE Signal Proc Mag 38(5):670–83. https://doi.org/10.1109/msp.2021.3090678
Article Google Scholar
Muhammad G, Melhem M (2014) Pathological voice detection and binary classification using mpeg-7 audio features. Biomed Signal Process Control 11:1–9
Article Google Scholar
Ntalampiras S, Potamitis I, Fakotakis N (2009) On acoustic surveillance of hazardous situations. In: IEEE international conference on acoustics, speech and signal processing, pp 165–168
Ntalampiras S, Potamitis I, Fakotakis N (2009) A portable system for robust acoustic detection of atypical situations. In: 17th European signal processing conference, pp 1121–1125
Piczak K J (2015) Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp 1–6
Piczak K J (2016) Recognizing bird species in audio recordings using deep convolutional neural networks. In: CLEF
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Google Scholar
Saeed A, Grangier D, Zeghidour N (2020) Contrastive learning of general-purpose audio representations
Shah A, Kumar A, Hauptmann A G, Raj B (2018) A closer look at weak label learning for audio events. arXiv:1804.09288
Stevens S S, Volkmann J, Newman E B (1937) A scale for the measurement of the psychological magnitude pitch. J Acoust Soc Am 8:185–190
Article Google Scholar
Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley M D (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimed 17(10):1733–1746
Article Google Scholar
Sun Y, Ghaffarzadegan S (2020) An ontology-aware framework for audio event classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020
Turpault N, Serizel R (2020) Training sound event detection on a heterogeneous dataset
Vasilakis M, Stylianou Y (2009) Spectral jitter modeling and estimation. Biomed Signal Process Control 4(3):183–193
Article Google Scholar
Wang Z, Casebeer J, Clemmitt A, Tzinis E, Smaragdis P (2021) Sound event detection with adaptive frequency selection. arXiv:2105.07596
Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson P J B, Plumbley M D (2016) Fully deep neural networks incorporating unsupervised feature learning for audio tagging. CoRR, arXiv:1607.03681
Zang Z, Yang M, Liu L (2019) An improved system for dcase 2019 challenge task4. Technical Report, University of Electronic Science and Technology of China School of Information and Communication Engineering
Zhuang X, Zhou X, Huang T S, Hasegawa-Johnson M (2008) Feature analysis and selection for acoustic event detection. In: 2008 IEEE international conference on acoustics, speech and signal processing, pp 17–20

Download references

Acknowledgements

We would like to thank Türk Telekom Research Center for providing hardware components for the experiments. Dr. Kalkan is supported by the BAGEP Award of the Science Academy, Turkey.

Author information

Authors and Affiliations

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
Selver Ezgi Küçükbay, Adnan Yazıcı & Sinan Kalkan
Department of Computer Science, SEDS, Nazarbayev University, Nur Sultan, Kazakhstan
Adnan Yazıcı

Authors

Selver Ezgi Küçükbay
View author publications
You can also search for this author in PubMed Google Scholar
Adnan Yazıcı
View author publications
You can also search for this author in PubMed Google Scholar
Sinan Kalkan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Selver Ezgi Küçükbay.

Ethics declarations

Conflict of Interests

Selver Ezgi Küçükbay, Adnan Yazıcı and Sinan Kalkan declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Küçükbay, S.E., Yazıcı, A. & Kalkan, S. Hand-crafted versus learned representations for audio event detection. Multimed Tools Appl 81, 30911–30930 (2022). https://doi.org/10.1007/s11042-022-12873-5

Download citation

Received: 05 July 2021
Revised: 01 March 2022
Accepted: 10 March 2022
Published: 07 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11042-022-12873-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hand-crafted versus learned representations for audio event detection

Abstract

Access this article

Similar content being viewed by others

A review of deep learning techniques in audio event recognition (AER) applications

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hand-crafted versus learned representations for audio event detection

Abstract

Access this article

Similar content being viewed by others

A review of deep learning techniques in audio event recognition (AER) applications

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation