Skip to main content

Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14087))

Included in the following conference series:

Abstract

Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works’ effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN) to overcome these problems. The PIPMN reaches 95.5% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or transfer learning. The PIPMN can achieve similar or even exceeds other state-of-the-art models with much less parameters under this setting. The Code is available on the https://github.com/JNAIC/PIPMN.

Y. Chen, Y. Zhu and Z. Yan—contributed equally to the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings ACM International Conference on Multimedia, pp. 1041–1044 (2014)

    Google Scholar 

  2. Tzanetakis, G., Cook, P.: Musical genre classifcation of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002). https://doi.org/10.1109/TSA.2002.800560

    Article  Google Scholar 

  3. Alexandre, E., et al.: Feature selection for sound classification in hearing aids through restricted search driven by genetic algorithms. IEEE Trans. Audio Speech Lang. Process. 15(8), 2249–2256 (2007). https://doi.org/10.1109/TASL.2007.905139

    Article  Google Scholar 

  4. Barchiesi, D., Giannoulis, D.D., Stowell, D., Plumbley, M.D.: Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Process. Mag. 32(3), 16–34 (2015). https://doi.org/10.1109/MSP.2014.2326181

    Article  Google Scholar 

  5. González-Hernández, F.R., et al.: Marine mammal sound classification based on a parallel recognition model and octave analysis. Appl. Acoust. 119, 17–28 (2017). https://doi.org/10.1016/J.APACOUST.2016.11.016

    Article  Google Scholar 

  6. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: A cascade-hybrid music recommender system for mobile services based on musical genre classification and personality diagnosis. Multimedia Tools Appl. 59, 241–258 (2012)

    Article  Google Scholar 

  7. Silverman, M.J.: Music-based affect regulation and unhealthy music use explain coping strategies in adults with mental health conditions. Community Ment. Health J. 56(5), 939–946 (2020). https://doi.org/10.1007/s10597-020-00560-4

    Article  Google Scholar 

  8. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)

    Article  Google Scholar 

  9. Huang, J., et al.: Acoustic scene classification using deep learning-based ensemble averaging. In: Proceedings of Detection Classification Acoustic Scenes Events Workshop (2019)

    Google Scholar 

  10. Tak, R.N., Agrawal, D.M., Patil, H.A.: Novel phase encoded mel filterbank energies for environmental sound classification. In: Shankar, B.U., Ghosh, K., Mandal, D.P., Ray, S.S., Zhang, D., Pal, S.K. (eds.) PReMI 2017. LNCS, vol. 10597, pp. 317–325. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69900-4_40

    Chapter  Google Scholar 

  11. Kumar, A., Khadkevich, M., Fügen, C.: Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 326–330 (2018)

    Google Scholar 

  12. Kumar, A., Ithapu, V.: A sequential self teaching approach for improving generalization in sound event recognition. In: Proceedings of 37th International Conference on Machine Learning, pp. 5447–5457 (2020)

    Google Scholar 

  13. Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: Proceedings of 30th International Conference on Neural Information Processing Systems, pp. 892–900 (2016)

    Google Scholar 

  14. Zhang, L., Shi, Z., Han, J.: Pyramidal temporal pooling with discriminative mapping for audio classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 770–784 (2020)

    Article  Google Scholar 

  15. Zhang, L., Han, J., Shi, Z.: Learning temporal relations from semantic neighbors for acoustic scene classification. IEEE Signal Process. Lett. 27, 950–954 (2020)

    Article  Google Scholar 

  16. Zhang, L., Han, J., Shi, Z.: ATReSN-Net: capturing attentive temporal relations in semantic neighborhood for acoustic scene classification. In: Proceedings of Annual Conference of the International Speech Communication Association, pp. 1181–1185 (2020)

    Google Scholar 

  17. Ilya, T., et al.: MLP-mixer: an all-MLP architecture for vision. In: Neural Information Processing Systems, pp. 24261–24272 (2021)

    Google Scholar 

  18. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical lmage database. In: CVPR 2009 (2009)

    Google Scholar 

  19. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Computer Vision and Pattern Recognition, pp. 11966–11976 (2022)

    Google Scholar 

  20. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: International Conference on Computer Vision, pp. 32–42 (2021)

    Google Scholar 

  21. Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39

    Chapter  Google Scholar 

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  23. Lei, J.B., Ryan, K., Geoffrey, E.H., Jimmy, L.B., Jamie, R.K., et al.: Layer normalization. Computing Research Repository, abs/1607.06450 (2016)

    Google Scholar 

  24. Hendrycks, D., Gimpel, K.: Gaussian Error Linear Units (GELUs). arXiv.org (2022). https://arxiv.org/abs/1606.08415. Accessed 15 Sept 2022

  25. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), pp. 6000–6010. Curran Associates Inc., Red Hook (2017)

    Google Scholar 

  26. Zouhir, Y., Ouni, K.: Feature extraction method for improving speech recognition in noisy environments. J. Comput. Sci. 12, 56–61 (2016). https://doi.org/10.3844/jcssp.2016.56.61

    Article  Google Scholar 

  27. Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia 14(6), 1684–1689 (2012). https://doi.org/10.1109/TMM.2012.2199972

    Article  Google Scholar 

  28. Zhou, X., et al.: Linear versus mel frequency cepstral coefficients for speaker recognition. In: 2011 IEEE Workshop on Automatic Speech RecognitionUnderstanding, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888

  29. Kumar, C., et al.: Analysis of MFCC and BFCC in a speaker identification system. In: 2018 International Conference on Computing, Mathematics and Engineering Technologies (2018)

    Google Scholar 

  30. Alexey, D., Lucas, B., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  31. Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: International Conference on Computer Vision, pp. 9961–9970 (2021)

    Google Scholar 

  32. Stéphane, D., Hugo, T., et al.: Convit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, vol. 139 pp. 2286–2296 (2021)

    Google Scholar 

  33. Touvron, H., Cord, M., Jégou, H.: DeiT III: revenge of the ViT. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 516–533. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_30

    Chapter  Google Scholar 

  34. Hedegaard, L., Bakhtiarnia, A., Iosifidis, A.: Continual Transformers: Redundancy-Free Attention for Online Inference, arXiv.org (2022). https://arxiv.org/abs/2201.06268

  35. Liu, C., Feng, L., Liu, G., Wang, H., Liu, S.: Bottom-up Broadcast Neural Network for Music Genre Classification, arXiv.org (2022). https://arxiv.org/abs/1901.08928

  36. Heakl, A., Abdelgawad, A., Parque, V.: A study on broadcast networks for music genre classification. In: IEEE International Joint Conference on Neural Network, pp. 1–8 (2022)

    Google Scholar 

  37. Bahmei, B., et al.: CNN-RNN and data augmentation using deep convolutional generative adversarial network for environmental sound classification. IEEE Signal Process. Lett. 29, 682–686 (2022)

    Article  Google Scholar 

  38. Song, H., Deng, S., Han, J.: Exploring inter-node relations in CNNs for environmental sound classification. IEEE Signal Process. Lett. 29, 154–158 (2022)

    Article  Google Scholar 

  39. Chen, Y., Zhu, Y., Yan, Z., Chen, L.: Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block (2022)

    Google Scholar 

  40. Wightman, R.: PyTorch Image Models (2019). https://github.com/rwightman/pytorch-image-models

  41. Fonseca, E., et al.: Audio tagging with noisy labels and minimal supervision.In: Proceedings of DCASE2019 Workshop, NYC, US (2019)

    Google Scholar 

  42. Woo, S., et al.: ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. arXiv abs/2301.00808 (2023)

    Google Scholar 

  43. Gong, Y., Chung, Y.-A., Glass, J.R.: AST: Audio spectrogram transformer. In: Interspeech (2021)

    Google Scholar 

  44. Chen, Y., et al.: Data Augmentation for Environmental Sound Classification Using Diffusion Probabilistic Model with Top-K Selection Discriminator. arXiv:2023.15161 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunhao Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Y. et al. (2023). Effective Audio Classification Network Based on Paired Inverse Pyramid Structure and Dense MLP Block. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science, vol 14087. Springer, Singapore. https://doi.org/10.1007/978-981-99-4742-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4742-3_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4741-6

  • Online ISBN: 978-981-99-4742-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics