Abstract
Environment Sound Classification (ESC) has been a challenging task in the audio field due to the different types of ambient sounds involved. In this paper, we propose a method for the ESC tasks based on the CAR-Transformer neural network model, which includes stages of sound sample pre-processing, deep learning-based feature extraction, and classifier classification. We convert the one-dimensional audio signal into two-dimensional Mel Frequency Cepstral Coefficients (MFCC) and use them as the feature map of the audio. The CAR-Transformer model was used for feature extraction, and after dimensionality reduction of the extracted feature map, we use the fully connected layer as a classifier of the feature map to obtain the final results. The method achieves a classification accuracy of 96.91% on the UrbanSound8K dataset, while the number of parameters in the model is only 0.16 M. The results of this paper were compared with other state-of-art research.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
S. Abdoli, P. Cardinal, A.L. Koerich, End-to-end environmental sound classification using a 1D convolutional neural network. Expert Syst. Appl. 1(136), 252–263 (2019)
Z. Ali, M. Talha, Innovative method for unsupervised voice activity detection and classification of audio segments. Ieee Access 6, 15494–15504 (2018)
V. Boddapati, A. Petef, J. Rasmusson, L. Lundberg, Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 112, 2048–2056 (2017)
K. Choi, G. Fazekas, M. Sandler, K. Cho. Transfer learning for music classification and regression tasks. In 18th International Society for Music Information Retrieval Conference, ISMIR 2017. pp. 141–149(2017)
M. Crocco, M. Cristani, A. Trucco, V. Murino, Audio surveillance: a systematic review. ACM Comput. Surv. (CSUR) 48(4), 1–46 (2016)
F. Demir, M. Turkoglu, M. Aslan, A. Sengur, A new pyramidal concatenated CNN approach for environmental sound classification. Appl. Acoust. 170, 107520 (2020)
X. Dong, B. Yin, Y. Cong, Z. Du, X. Huang, Environment sound event classification with a two-stream convolutional neural network. IEEE Access. 8, 125714–125721 (2020)
D. Elliott, C. E. Otero, S. Wyatt, E. Martino. Tiny transformers for environmental sound classification at the edge. arXiv preprint arXiv:2103.12157. (2021)
S. Ewert. Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proc. ISMIR. (2011)
T. Giannakopoulos, E. Spyrou, S. J. Perantonis. Recognition of urban sound events using deep context-aware feature extractors and handcrafted features. In Artificial Intelligence Applications and Innovations: AIAI 2019 IFIP WG 12.5 International Workshops: MHDW and 5G-PINE 2019, Hersonissos, Crete, Greece, May 24–26, 2019, Proceedings 15. pp. 184–195. Springer International Publishing. (2019)
C. Harte, M. Sandler, M. Gasser. Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on Audio and music computing multimedia. pp. 21–26 (2006, October)
K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778(2016)
Q. Hou, D. Zhou, J. Feng. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13713–13722(2021)
Z. Huang, C. Liu, H. Fei, W. Li, J. Yu, Y. Cao, Urban sound classification based on 2-order dense convolutional network using dual features. Appl. Acoust. 164, 107243 (2020)
D. N. Jiang, L. Lu, H. J. Zhang, J. H. Tao, L. H. Cai. Music type classification by spectral contrast feature. In Proceedings. IEEE International Conference on Multimedia and Expo. Vol. 1, pp. 113–116(2002, August)
N. Kitaev, Ł. Kaiser, A. Levskaya. Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451.(2020)
J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, Y. W. Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning. pp. 3744–3753 (2019)
H. Li, S. Ishikawa, Q. Zhao, M. Ebana, H. Yamamoto, J. Huang. Robot navigation and sound based position identification. In 2007 IEEE International Conference on Systems, Man and Cybernetics. pp. 2449–2454(2007)
S. Li, Y. Yao, J. Hu, G. Liu, X. Yao, J. Hu, An ensemble stacked convolutional neural network model for environmental event sound recognition. Appl. Sci. 8(7), 1152 (2018)
J.S. Luz, M.C. Oliveira, F.H. Araujo, D.M. Magalhães, Ensemble of handcrafted and deep features for urban sound classification. Appl. Acoust. 175, 107819 (2021)
F. Medhat, D. Chesmore, J. Robinson, Masked conditional neural networks for sound classification. Appl. Soft Comput. 90, 106073 (2020)
Z. Mushtaq, S.F. Su, Environmental sound classification using a regularized deep convolutional neural network with data augmentation. Appl. Acoust. 167, 107389 (2020)
H. Park, C.D. Yoo, CNN-based learnable gammatone filterbank and equal-loudness normalization for environmental sound classification. IEEE Signal Process. Lett. 27, 411–415 (2020)
N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran. Image transformer. In International conference on machine learning. pp. 4055–4064. PMLR. (2018, July)
N. Peng, A. Chen, G. Zhou, W. Chen, W. Zhang, J. Liu, F. Ding, Environment sound classification based on visual multi-feature fusion and GRU-AWS. IEEE Access 8, 191100–191114 (2020)
K. J. Piczak. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. pp. 1015–1018(2015, October)
K. J. Piczak. Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th international workshop on machine learning for signal processing (MLSP). pp. 1–6(2015, September)
J. Salamon, C. Jacoby, J. P. Bello. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia. pp. 1041–1044 (2014, November)
J. Sharma, O. C. Granmo, M. Goodwin. Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network. In Interspeech. Vol. 2020, pp. 1186–1190(2020, October)
Y. Su, K. Zhang, J. Wang, K. Madani, Environment sound classification using a two-stream CNN based on decision-level fusion. Sensors 19(7), 1733 (2019)
Y. Su, K. Zhang, J. Wang, D. Zhou, K. Madani, Performance analysis of multiple aggregated acoustic features for environment sound classification. Appl. Acoust. 158, 107050 (2020)
Y. Tokozume, T. Harada. Learning environmental sounds with end-to-end convolutional neural network. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 2721–2725(2017, March)
T. Tuncer, A. Subasi, F. Ertam, S. Dogan, A novel spiral pattern and 2D M4 pooling based environmental sound classification method. Appl. Acoust. 170, 107508 (2020)
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ..., I. Polosukhin. Attention is all you need. In Advances in neural information processing systems. pp. 5998–6008(2017)
N. Yamakawa, T. Takahashi, T. Kitahara, T. Ogata, H.G. Okuno. Environmental Sound Recognition for Robot Audition Using Matching-Pursuit. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.). Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science, vol 6704. Springer, Berlin, Heidelberg (2011). Doi: https://doi.org/10.1007/978-3-642-21827-9_1
J. Ye, T. Kobayashi, X. Wang, H. Tsuda, M. Murakawa, Audio data mining for anthropogenic disaster identification: an automatic taxonomy approach. IEEE Trans. Emerg. Top. Comput. 8(1), 126–136 (2017)
H. Zhang, I. Mcloughlin, Y. Song. Robust sound event recognition using convolutional neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 559–563 (2015)
Z. Zhang, S. Xu, S. Cao, S. Zhang. Deep convolutional neural network with mixup for environmental sound classification. In Chinese conference on pattern recognition and computer vision (prcv). pp. 356–367. Springer, Cham. (2018, November)
Acknowledgements
This work was supported by Hunan Key Laboratory of Intelligent Logistics Technology (2019TP1015)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Chen, A., Yi, J. et al. Environmental Sound Classification Based on CAR-Transformer Neural Network Model. Circuits Syst Signal Process 42, 5289–5312 (2023). https://doi.org/10.1007/s00034-023-02339-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02339-w