Abstract
Music genre classification (MGC) is an indispensable branch of music information retrieval. With the prevalence of end-to-end learning, the research on MGC has made some breakthroughs. However, the limited receptive field of convolutional neural network (CNN) cannot capture a correlation between temporal frames of sounding at any moment and sound frequencies of all vibrations in the song. Meanwhile, time–frequency information of channels is not equally important. In order to deal with the above problems, we apply dual parallel attention (DPA) in CNN-5 to focus on global dependencies. First, we propose parallel channel attention (PCA) to build global time–frequency dependencies in the song and study the influence of different weighting methods for PCA. Next, we design dual parallel attention, which focuses on global time–frequency dependencies in the song and adaptively calibrates contribution of different channels to feature map. Then, we analyzed the effect of applying different numbers and positions of DPA in CNN-5 for performance and compared DPA with multiple attention mechanisms. The results on GTZAN dataset demonstrated that the proposed method achieves a classification accuracy of 91.4%, and DPA has the highest performance.
Similar content being viewed by others
Notes
GTZAN is divided according to the proportion of 9:1 for the training set and test set in the paper [36], and ten-fold cross-validation is adopted. The average of the ten test results is the final result, which is consistent with the strategy of our paper. We treat the results of the test set as final when the model training is complete. Unfortunately, we were unable to reproduce the classification accuracy mentioned in their paper, which may be due to insufficient training details provided in the paper.
It is worth noting that: Non-local and DANet are codes provided by the author. FLA and PTS-A are implemented based on the content of the paper. All attention is applied in CNN-5 for comparison in the same way.
References
Ashraf M et al (2020) A Globally Regularized Joint Neural Architecture for Music Classification. IEEE Access 8:220980–220989
Cai X, Zhang H (2022) Music genre classification based on auditory image, spectral and acoustic features. Multimedia Syst 28(3):779–791
Downie JS (2003) Music information retrieval. Ann Rev Inf Sci Technol 37(1):295–340
Fu Z et al (2011) A Survey of Audio-Based Music Classification and Annotation. IEEE Trans Multimedia 13(2):303–319
Gao Y (2020) Research on Music Audio Classification Based on Deep Learning. South China University of Technology Guangzhou, China
Gardner MW, Dorling S (1998) Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmos Environ 32(14–15):2627–2636
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Promane BC (2009) Freddie mercury and queen: Technologies of genre and the poetics of innovation. University of Western Ontario, School of Graduate and Postdoctoral Studies
Sarikaya R, Hinton GE, Deoras A (2014) Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(4):778–784
Scalvenzi RR, Guido RC, Marranghello N (2019) Wavelet-packets associated with support vector machine are effective for monophone sorting in music signals. Int. J. Semant. Comput. 13(03):415–425
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Transactions on speech and audio processing 10(5):293–302
Weiss K, Khoshgoftaar TM, Wang D (2016) A survey of transfer learning. Journal of Big data 3(1):1–40
Yu Y et al (2020) Deep attention based music genre classification. Neurocomputing 372:84–91
Zhang X et al (2019) Spectrogram-frame linear network and continuous frame sequence for bird sound classification. Eco Inform 54:101009
Zhang Z et al (2021) Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing 453:896–903
Schedl M, Gómez Gutiérrez E, and Urbano J (2014) Music information retrieval: Recent developments and applications. Foundations and Trends in Information Retrieval. 12; 8 (2–3): 127–261
Ndou N, Ajoodha R, Jadhav A (2021) Music Genre Classification: A Review of Deep-Learning and Traditional Machine-Learning Approaches. in 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS). IEEE
Gupta R, Yadav J, and Kapoor C (2021) Music information retrieval and intelligent genre classification. in Proceedings of International Conference on Intelligent Computing, Information and Control Systems Springer
Pálmason H, et al (2017) Music genre classification revisited: An in-depth examination guided by music experts. in International Symposium on Computer Music Multidisciplinary Research 7 Springer
Baniya BK, Ghimire D, Lee J (2014) A novel approach of automatic music genre classification based on timbrai texture and rhythmic content features. in 16th International Conference on Advanced Communication Technology IEEE
Arabi, A.F. and G. Lu. Enhanced polyphonic music genre classification using high level features. in 2009 IEEE International Conference on Signal and Image Processing Applications. 2009. IEEE
Saunders C et al (1998) Support vector machine reference manual
Sarkar R, and Saha SK (2015) Music genre classification using EMD and pitch based feature. in 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR) IEEE
Vaswani A et al (2017) Attention is all you need. in Advances in neural information processing systems
He K et al (2016) Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition
Piczak KJ (2015) Environmental sound classification with convolutional neural networks. in 2015 IEEE 25th international workshop on machine learning for signal processing (MLSP) IEEE
Himawan I, Towsey M, Roe (2018) P 3D convolution recurrent neural networks for bird sound detection. in Proceedings of the 3rd Workshop on Detection and Classification of Acoustic Scenes and Events. Detection and Classification of Acoustic Scenes and Events
Kahl S et al (2017) Large-Scale Bird Sound Classification using Convolutional Neural Networks, in CLEF (working notes)
Yang B (2008) A study of inverse short-time Fourier transform. in 2008 IEEE Int. Conf. Acoust. Speech Signal Process. IEEE
Zhang W et al (2016) Improved Music Genre Classification with Convolutional Neural Networks, in Interspeech 2016. 3304–3308
Choi K et al (2017) Convolutional recurrent neural networks for music classification. in 2017 IEEE Int. Conf. Acoust. Speech Signal Process (ICASSP) IEEE
Cho K et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit
Yang H, Zhang W.-Q (2019) Music Genre Classification Using Duplicated Convolutional Layers in Neural Networks, in Interspeech 2019 3382–3386
Chang P-C, Chen Y-S, Lee C.-H (2021) MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification, in Proceedings of the 2021 Int. Conf Multimed. Retr.. 29–36
Choi K et al (2017) Transfer learning for music classification and regression tasks. arXiv preprint arXiv:1703.09179
Srinivasu PN et al (2022) Ambient Assistive Living for Monitoring the Physical Activity of Diabetic Adults through Body Area Networks. Mob. Inf. Syst 2022
Wang X et al (2018) Non-local neural networks. in Proceedings of the IEEE Conf. Comput. Vis. Pattern Recognit
Wang H et al (2019) Environmental sound classification with parallel temporal-spectral attention. arXiv preprint arXiv:1912.06808
Huang Z et al (2022). ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition. arXiv preprint arXiv:2204.05649
Dosovitskiy A et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Gong Y, Chung Y-A, and Glass J (2021) Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778
Yang L, and Zhao H (2021) Sound Classification Based on Multihead Attention and Support Vector Machine. Math. Probl. Eng 2021
Lin M, Chen Q, and Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400
Ioffe S, and Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. in Int confe machine learning. PMLR
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. in Icml
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Zhang P et al (2015) A Deep Neural Network for Modeling Music, in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval 379–386
Karunakaran N, Arya A (2018) A scalable hybrid classifier for music genre classification using machine learning concepts and spark. in 2018 Int Confe Intell Auton Syst (ICoIAS) IEEE
Fu J et al (2019) Dual attention network for scene segmentation. in Proceedings of the IEEE/CVF Conf. Comput. Vis. Pattern Recognit
Funding
This work was supported by Postgraduate Scientific Research Innovation Project of Hunan Province (CX20210879), Postgraduate Scientific Research Innovation Project of Central South University of Forestry and Technology (CX202102059) and Hunan Key Laboratory of Intelligent Logistics Technology (2019TP1015).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wen, Z., Chen, A., Zhou, G. et al. Parallel attention of representation global time–frequency correlation for music genre classification. Multimed Tools Appl 83, 10211–10231 (2024). https://doi.org/10.1007/s11042-023-16024-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16024-2