Abstract
Audio Event Detection (AED) pertains to identifying the types of events in audio signals. AED is essential for applications requiring decisions based on audio signals, which can be critical, for example, for health, surveillance and security applications. Despite the proven benefits of deep learning in obtaining the best representation for solving a problem, AED studies still generally employ hand-crafted representations even when deep learning is used for solving the AED task. Intrigued by this, we investigate whether or not hand-crafted representations (i.e. spectogram, mel spectogram, log mel spectogram and mel frequency cepstral coefficients) are better than a representation learned using a Convolutional Autoencoder (CAE). To the best of our knowledge, our study is the first to ask this question and thoroughly compare feature representations for AED. To this end, we first find the best hop size and window size for each hand-crafted representation and compare the optimized hand-crafted representations with CAE-learned representations. Our extensive analyses on a subset of the AudioSet dataset confirm the common practice in that hand-crafted representations do perform better than learned features by a large margin (\(\sim \)30 AP). Moreover, we show that the commonly used window and hop sizes do not provide the optimal performances for the hand-crafted representations.
Similar content being viewed by others
References
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video. In: Advances in neural information processing systems
Becker S, Ackermann M, Lapuschkin S, Müller K-R, Samek W (2018) Interpreting and explaining deep neural networks for classification of audio signals. CoRR
Çakir E, Parascandolo G, Heittola T, Huttunen H, Virtanen T (2017) Convolutional recurrent neural networks for polyphonic sound event detection. CoRR
Çakir E, Virtanen T (2018) End-to-end polyphonic sound event detection using convolutional recurrent neural networks with learned time-frequency representation input. CoRR
Dai W, Dai C, Qu S, Li J, Das S (2016) Very deep convolutional neural networks for raw waveforms
Dinkel H, Qian Y, Yu K P (2018) A hybrid asr model approach on weakly labeled scene classification
Eutizi C, Benedetto F (2021) On the performance improvements of deep learning methods for audio event detection and classification. In: 2021 44th International Conference on Telecommunications and Signal Processing (TSP), pp 141–145
Fonseca E, Ortego D, McGuinness K, O’Connor N E, Serra X (2020) Unsupervised contrastive learning of sound event representations
Gemmeke J F, Ellis D P W, Freedman D, Jansen A, Lawrence W, Moore R C, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 776–780
Giannakopoulos T, Spyrou E, Perantonis S (2019) Recognition of urban sound events using deep context-aware feature extractors and handcrafted features. Int. Conf. on Artificial Intelligence Applications and Innovations
Hershey S, Chaudhuri S, Ellis D P W, Gemmeke J F, Jansen A, Moore R C, Plakal M, Platt D, Saurous R A, Seybold B, Slaney M, Weiss R J, Wilson K W (2016) CNN architectures for large-scale audio classification. CoRR
Kayser M, Zhong V (2015) Denoising convolutional autoencoders for noisy speech recognition. Technical Report, CS231 Standford Reports
Kong Q, Cao Y, Iqbal T, Wang Y, Wang W, Plumbley M D (2020) Panns: Large-scale pretrained audio neural networks for audio pattern recognition
Kothinti S, Sell G, Watanabe S, Elhilali M (2019) Integrated bottom-up and top-down inference for sound event detection. Technical Report, Department of Electrical and Computer Engineering. Johns Hopkins University, Baltimore
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Kwak J-Y, Chung Y-J (2020) Sound event detection using derivative features in deep neural networks. Appl Sci 10:4911. https://doi.org/10.3390/app10144911
Lee J, Kim T, Park J, Nam J (2017) Raw waveform-based audio classification using sample-level cnn architectures. 31st Conf. on Neural Information Processing Systems (NIPS)
Lefèvre S, Vincent N (2011) A two level strategy for audio segmentation. Digit Signal Process 21(2):270–277
Li J, Dai W, Metze F, Qu S, Das S (2017) A comparison of deep learning methods for environmental sound. CoRR
Liu H, Zhang S (2012) Noisy data elimination using mutual k-nearest neighbor for classification mining. J Syst Softw 85(5):1067–1074
Maas A, Le Q, neil T, Vinyals O, Nguyen P, Ng A (2012) Recurrent neural networks for noise reduction in robust asr. 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012 1
Maria A, Jeyaseelan A S (2021) Development of optimal feature selection and deep learning toward hungry stomach detection using audio signals. J Control Autom Electr Syst 32
Mesaros A, Heittola T, Virtanen T (2016) Tut database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp 1128–1132
Mesaros A, Heittola T, Virtanen T, Plumbley M D (2021) Sound event detection: a tutorial. IEEE Signal Proc Mag 38(5):670–83. https://doi.org/10.1109/msp.2021.3090678
Muhammad G, Melhem M (2014) Pathological voice detection and binary classification using mpeg-7 audio features. Biomed Signal Process Control 11:1–9
Ntalampiras S, Potamitis I, Fakotakis N (2009) On acoustic surveillance of hazardous situations. In: IEEE international conference on acoustics, speech and signal processing, pp 165–168
Ntalampiras S, Potamitis I, Fakotakis N (2009) A portable system for robust acoustic detection of atypical situations. In: 17th European signal processing conference, pp 1121–1125
Piczak K J (2015) Environmental sound classification with convolutional neural networks. In: 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp 1–6
Piczak K J (2016) Recognizing bird species in audio recordings using deep convolutional neural networks. In: CLEF
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Saeed A, Grangier D, Zeghidour N (2020) Contrastive learning of general-purpose audio representations
Shah A, Kumar A, Hauptmann A G, Raj B (2018) A closer look at weak label learning for audio events. arXiv:1804.09288
Stevens S S, Volkmann J, Newman E B (1937) A scale for the measurement of the psychological magnitude pitch. J Acoust Soc Am 8:185–190
Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley M D (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimed 17(10):1733–1746
Sun Y, Ghaffarzadegan S (2020) An ontology-aware framework for audio event classification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020
Turpault N, Serizel R (2020) Training sound event detection on a heterogeneous dataset
Vasilakis M, Stylianou Y (2009) Spectral jitter modeling and estimation. Biomed Signal Process Control 4(3):183–193
Wang Z, Casebeer J, Clemmitt A, Tzinis E, Smaragdis P (2021) Sound event detection with adaptive frequency selection. arXiv:2105.07596
Xu Y, Huang Q, Wang W, Foster P, Sigtia S, Jackson P J B, Plumbley M D (2016) Fully deep neural networks incorporating unsupervised feature learning for audio tagging. CoRR, arXiv:1607.03681
Zang Z, Yang M, Liu L (2019) An improved system for dcase 2019 challenge task4. Technical Report, University of Electronic Science and Technology of China School of Information and Communication Engineering
Zhuang X, Zhou X, Huang T S, Hasegawa-Johnson M (2008) Feature analysis and selection for acoustic event detection. In: 2008 IEEE international conference on acoustics, speech and signal processing, pp 17–20
Acknowledgements
We would like to thank Türk Telekom Research Center for providing hardware components for the experiments. Dr. Kalkan is supported by the BAGEP Award of the Science Academy, Turkey.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
Selver Ezgi Küçükbay, Adnan Yazıcı and Sinan Kalkan declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Küçükbay, S.E., Yazıcı, A. & Kalkan, S. Hand-crafted versus learned representations for audio event detection. Multimed Tools Appl 81, 30911–30930 (2022). https://doi.org/10.1007/s11042-022-12873-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12873-5