Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block

Chen, Yunhao; Zhu, Yunjie; Yan, Zihui; Ren, Zhen; Huang, Yifan; Shen, Jianlu; Chen, Lifang

doi:10.1007/978-981-99-4742-3_6

Yunhao Chen ORCID: orcid.org/0000-0002-8134-2314¹³,
Yunjie Zhu¹⁴,
Zihui Yan¹³,
Zhen Ren¹³,
Yifan Huang¹³,
Jianlu Shen¹³ &
…
Lifang Chen¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14087))

Included in the following conference series:

International Conference on Intelligent Computing

1022 Accesses
1 Citations

Abstract

Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works’ effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN) to overcome these problems. The PIPMN reaches 95.5% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or transfer learning. The PIPMN can achieve similar or even exceeds other state-of-the-art models with much less parameters under this setting. The Code is available on the https://github.com/JNAIC/PIPMN.

Y. Chen, Y. Zhu and Z. Yan—contributed equally to the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings ACM International Conference on Multimedia, pp. 1041–1044 (2014)
Google Scholar
Tzanetakis, G., Cook, P.: Musical genre classifcation of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002). https://doi.org/10.1109/TSA.2002.800560
Article Google Scholar
Alexandre, E., et al.: Feature selection for sound classification in hearing aids through restricted search driven by genetic algorithms. IEEE Trans. Audio Speech Lang. Process. 15(8), 2249–2256 (2007). https://doi.org/10.1109/TASL.2007.905139
Article Google Scholar
Barchiesi, D., Giannoulis, D.D., Stowell, D., Plumbley, M.D.: Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Process. Mag. 32(3), 16–34 (2015). https://doi.org/10.1109/MSP.2014.2326181
Article Google Scholar
González-Hernández, F.R., et al.: Marine mammal sound classification based on a parallel recognition model and octave analysis. Appl. Acoust. 119, 17–28 (2017). https://doi.org/10.1016/J.APACOUST.2016.11.016
Article Google Scholar
Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: A cascade-hybrid music recommender system for mobile services based on musical genre classification and personality diagnosis. Multimedia Tools Appl. 59, 241–258 (2012)
Article Google Scholar
Silverman, M.J.: Music-based affect regulation and unhealthy music use explain coping strategies in adults with mental health conditions. Community Ment. Health J. 56(5), 939–946 (2020). https://doi.org/10.1007/s10597-020-00560-4
Article Google Scholar
Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)
Article Google Scholar
Huang, J., et al.: Acoustic scene classification using deep learning-based ensemble averaging. In: Proceedings of Detection Classification Acoustic Scenes Events Workshop (2019)
Google Scholar
Tak, R.N., Agrawal, D.M., Patil, H.A.: Novel phase encoded mel filterbank energies for environmental sound classification. In: Shankar, B.U., Ghosh, K., Mandal, D.P., Ray, S.S., Zhang, D., Pal, S.K. (eds.) PReMI 2017. LNCS, vol. 10597, pp. 317–325. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69900-4_40
Chapter Google Scholar
Kumar, A., Khadkevich, M., Fügen, C.: Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 326–330 (2018)
Google Scholar
Kumar, A., Ithapu, V.: A sequential self teaching approach for improving generalization in sound event recognition. In: Proceedings of 37th International Conference on Machine Learning, pp. 5447–5457 (2020)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of 30th International Conference on Neural Information Processing Systems, pp. 892–900 (2016)
Google Scholar
Zhang, L., Shi, Z., Han, J.: Pyramidal temporal pooling with discriminative mapping for audio classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 770–784 (2020)
Article Google Scholar
Zhang, L., Han, J., Shi, Z.: Learning temporal relations from semantic neighbors for acoustic scene classification. IEEE Signal Process. Lett. 27, 950–954 (2020)
Article Google Scholar
Zhang, L., Han, J., Shi, Z.: ATReSN-Net: capturing attentive temporal relations in semantic neighborhood for acoustic scene classification. In: Proceedings of Annual Conference of the International Speech Communication Association, pp. 1181–1185 (2020)
Google Scholar
Ilya, T., et al.: MLP-mixer: an all-MLP architecture for vision. In: Neural Information Processing Systems, pp. 24261–24272 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical lmage database. In: CVPR 2009 (2009)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Computer Vision and Pattern Recognition, pp. 11966–11976 (2022)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: International Conference on Computer Vision, pp. 32–42 (2021)
Google Scholar
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Lei, J.B., Ryan, K., Geoffrey, E.H., Jimmy, L.B., Jamie, R.K., et al.: Layer normalization. Computing Research Repository, abs/1607.06450 (2016)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian Error Linear Units (GELUs). arXiv.org (2022). https://arxiv.org/abs/1606.08415. Accessed 15 Sept 2022
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), pp. 6000–6010. Curran Associates Inc., Red Hook (2017)
Google Scholar
Zouhir, Y., Ouni, K.: Feature extraction method for improving speech recognition in noisy environments. J. Comput. Sci. 12, 56–61 (2016). https://doi.org/10.3844/jcssp.2016.56.61
Article Google Scholar
Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia 14(6), 1684–1689 (2012). https://doi.org/10.1109/TMM.2012.2199972
Article Google Scholar
Zhou, X., et al.: Linear versus mel frequency cepstral coefficients for speaker recognition. In: 2011 IEEE Workshop on Automatic Speech RecognitionUnderstanding, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888
Kumar, C., et al.: Analysis of MFCC and BFCC in a speaker identification system. In: 2018 International Conference on Computing, Mathematics and Engineering Technologies (2018)
Google Scholar
Alexey, D., Lucas, B., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: International Conference on Computer Vision, pp. 9961–9970 (2021)
Google Scholar
Stéphane, D., Hugo, T., et al.: Convit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, vol. 139 pp. 2286–2296 (2021)
Google Scholar
Touvron, H., Cord, M., Jégou, H.: DeiT III: revenge of the ViT. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 516–533. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_30
Chapter Google Scholar
Hedegaard, L., Bakhtiarnia, A., Iosifidis, A.: Continual Transformers: Redundancy-Free Attention for Online Inference, arXiv.org (2022). https://arxiv.org/abs/2201.06268
Liu, C., Feng, L., Liu, G., Wang, H., Liu, S.: Bottom-up Broadcast Neural Network for Music Genre Classification, arXiv.org (2022). https://arxiv.org/abs/1901.08928
Heakl, A., Abdelgawad, A., Parque, V.: A study on broadcast networks for music genre classification. In: IEEE International Joint Conference on Neural Network, pp. 1–8 (2022)
Google Scholar
Bahmei, B., et al.: CNN-RNN and data augmentation using deep convolutional generative adversarial network for environmental sound classification. IEEE Signal Process. Lett. 29, 682–686 (2022)
Article Google Scholar
Song, H., Deng, S., Han, J.: Exploring inter-node relations in CNNs for environmental sound classification. IEEE Signal Process. Lett. 29, 154–158 (2022)
Article Google Scholar
Chen, Y., Zhu, Y., Yan, Z., Chen, L.: Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block (2022)
Google Scholar
Wightman, R.: PyTorch Image Models (2019). https://github.com/rwightman/pytorch-image-models
Fonseca, E., et al.: Audio tagging with noisy labels and minimal supervision.In: Proceedings of DCASE2019 Workshop, NYC, US (2019)
Google Scholar
Woo, S., et al.: ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. arXiv abs/2301.00808 (2023)
Google Scholar
Gong, Y., Chung, Y.-A., Glass, J.R.: AST: Audio spectrogram transformer. In: Interspeech (2021)
Google Scholar
Chen, Y., et al.: Data Augmentation for Environmental Sound Classification Using Diffusion Probabilistic Model with Top-K Selection Discriminator. arXiv:2023.15161 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Jiangnan University, Wuxi, 214000, China
Yunhao Chen, Zihui Yan, Zhen Ren, Yifan Huang, Jianlu Shen & Lifang Chen
University of Leeds, Leeds, LS2 9JT, UK
Yunjie Zhu

Authors

Yunhao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yunjie Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zihui Yan
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Ren
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jianlu Shen
View author publications
You can also search for this author in PubMed Google Scholar
Lifang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunhao Chen .

Editor information

Editors and Affiliations

Department of Computer Science, Eastern Institute of Technology, Zhejiang, China
De-Shuang Huang
University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Zhengzhou University of Light Industry, Zhengzhou, China
Baohua Jin
Zhong Yuan University of Technology, Zhengzhou, China
Boyang Qu
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Department of Computer Science, Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y. et al. (2023). Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science, vol 14087. Springer, Singapore. https://doi.org/10.1007/978-981-99-4742-3_6

Download citation

DOI: https://doi.org/10.1007/978-981-99-4742-3_6
Published: 30 July 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4741-6
Online ISBN: 978-981-99-4742-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics