Skip to main content
Log in

Music genre classification based on auditory image, spectral and acoustic features

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Music genre is one of the conventional ways to describe music content, and also is one of the important labels of music information retrieval. Therefore, the effective and precise music genre classification method becomes an urgent need for realizing automatic organization of large music archives. Inspired by the fact that humans have a better automatic recognizing music genre ability, which may attribute to our auditory system, even for the participants with little musical literacy. In this paper, a novel classification framework incorporating the auditory image feature with traditional acoustic features and spectral feature is proposed to improve the classification accuracy. In detail, auditory image feature is extracted based on the auditory image model which simulates the auditory system of the human ear and has also been successfully used in other fields apart from music genre classification to our best knowledge. Moreover, the logarithmic frequency spectrogram rather than linear is adopted to extract the spectral feature to capture the information about the low-frequency part adequately. These above two features and the traditional acoustic feature are evaluated, compared, respectively, and fused finally based on the GTZAN, GTZAN-NEW, ISMIR2004 and Homburg datasets. Experimental results show that the proposed method owns the higher classification accuracy and the better stability than many state-of-the-art classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Allamy, S., Koerich, A.L.: 1D CNN Architectures for Music Genre Classification. arXiv preprint arXiv:210507302 (2021)

  2. Bleeck, S., Ives, T., Patterson, R.: Aim-mat: the auditory image model in matlab. Acta Acust. Acust. 90, 781–787 (2004)

    Google Scholar 

  3. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp 144–152 (1992). https://doi.org/10.1145/130385.130401

  4. Cano, P., Gômez, E., Gouyon, F., Herrera, P., Koppenberger, M., Ong, B., Serra, X., Streich, S., Wack, N.: ISMIR 2004 Audio Description Contest. Technical Report. Music Technology Group, Bracelona (2006)

    Google Scholar 

  5. Castillo, J.R., Flores, M.J.: Web-based music genre classification for timeline song visualization and analysis. IEEE Access 9, 18801–18816 (2021). https://doi.org/10.1109/ACCESS.2021.3053864

    Article  Google Scholar 

  6. Chaki, J.: Pattern analysis based acoustic signal processing: a survey of the state-of-art. Int. J. Speech Technol. (2020). https://doi.org/10.1007/s10772-020-09681-3

    Article  Google Scholar 

  7. Chan, W.C., Liang, P.H., Shih, Y.P., Yang, U.C., Chang Lin, W., Hsu, C.N.: Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinform. 11(1), 1–12 (2010)

    Article  Google Scholar 

  8. Çoban, Ö., Özyer, G.T.: Music genre classification from turkish lyrics. In: 2016 24th Signal Processing and Communication Application Conference (SIU), pp 101–104 (2016). https://doi.org/10.1109/SIU.2016.7495686

  9. Çoban, Ö.: Turkish music genre classification using audio and lyrics features. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 21(2), 322–331 (2017)

    Article  Google Scholar 

  10. Corrêa, D.C., Rodrigues, F.A.: A survey on symbolic data-based music genre classification. Expert Syst. Appl. 60, 190–210 (2016). https://doi.org/10.1016/j.eswa.2016.04.008

    Article  Google Scholar 

  11. Costa, Y., Oliveira, L., Koerich, A., Gouyon, F.: Music genre recognition using spectrograms. In: 2011 18th International Conference on Systems, Signals and Image Processing, pp 1–4 (2011)

  12. Costa, C.H.L., Valle, J.D., Koerich, A.L., Koerich, R.L.: Automatic classification of audio data. IEEE Trans. Syst. Man Cybernet. 1, 562–567 (2004). https://doi.org/10.1109/ICSMC.2004.1398359

    Article  Google Scholar 

  13. Costa, Y., Oliveira, L., Koerich, A., Gouyon, F., Martins, J.: Music genre classification using lbp textural features. Signal Process. 92(11), 2723–2737 (2012). https://doi.org/10.1016/j.sigpro.2012.04.023

    Article  Google Scholar 

  14. Costa, Y., Oliveira, L., Koerich, A., Gouyon, F.: Music genre recognition using gabor filters and lpq texture descriptors. Progress Pattern Recogn. Image Anal. Comput. Vis. Appl. 8259, 67–74 (2013). https://doi.org/10.1007/978-3-642-41827-3_9

    Article  Google Scholar 

  15. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  16. Foleis, J.H., Tavares, T.F.: Texture selection for automatic music genre classification. Appl. Soft Comput. 89, 106–127 (2020). https://doi.org/10.1016/j.asoc.2020.106127

    Article  Google Scholar 

  17. Fu, Z., Lu, G., Ting, K., Zhang, D.: On feature combination for music classification. In: Structural, Syntactic, and Statistical Pattern Recognition, pp 453–462 (2010)

  18. Fu, Z., Lu, G., Ting, K.M., Zhang, D.: A survey of audio-based music classification and annotation. IEEE Trans. Multimedia 13(2), 303–319 (2011). https://doi.org/10.1109/TMM.2010.2098858

    Article  Google Scholar 

  19. Glasberg, B., Moore, B.: Derivation of auditory filter shapes from notched-noise data. Hear. Res. 47(1), 103–138 (1990). https://doi.org/10.1016/0378-5955(90)90170-T

    Article  Google Scholar 

  20. Glasberg, B., Moore, B.: Frequency selectivity as a function of level and frequency measured with uniformly exciting notched noise. J. Acoust. Soc. Am. 108(5), 2318–2328 (2000). https://doi.org/10.1121/1.1315291

    Article  Google Scholar 

  21. Glasberg, B., Moore, B.: A model of loudness applicable to time-varying sounds. J. Audio Eng. Soc. 50, 331–342 (2002)

    Google Scholar 

  22. Gogate, M., Dashtipour, K., Hussain, A.: Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System. In: Proceeding Interspeech 2020, pp 4521–4525 (2020b). https://doi.org/10.21437/Interspeech.2020-2935

  23. Gogate, M., Dashtipour, K., Adeel, A., Hussain, A.: Cochleanet: a robust language-independent audio-visual model for speech enhancement. Inf. Fus. 63, 273–285 (2020). https://doi.org/10.1016/j.inffus.2020.04.001

    Article  Google Scholar 

  24. Homburg, H., Mierswa, I., Möller, B., Morik, K., Wurst, M.: A benchmark dataset for audio classification and clustering. ISMIR 2005, 528–531 (2005)

    Google Scholar 

  25. Hyder, R., Ghaffarzadegan, S., Feng, Z., Hansen, J., Hasan, T.: Acoustic Scene Classification using a CNN-Supervector System Trained with Auditory and Spectrogram Image Features. pp. 3073–3077 (2017). https://doi.org/10.21437/Interspeech.2017-431

  26. Irino, T., Patterson, R.: A dynamic compressive gammachirp auditory filterbank. IEEE Trans. Audio Speech Lang. Process. 14(6), 2222–2232 (2006). https://doi.org/10.1109/TASL.2006.874669

    Article  Google Scholar 

  27. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998). https://doi.org/10.1109/34.667881

    Article  Google Scholar 

  28. Lee, C.H., Shih, J.L., Yu, K.M., Lin, H.S.: Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features. IEEE Trans. Multimedia 11, 670–682 (2009). https://doi.org/10.1109/TMM.2009.2017635

    Article  Google Scholar 

  29. Li, T.L., Chan, A.B.: Genre classification and the invariance of mfcc features to key and tempo. In: International Conference on MultiMedia Modeling, Springer, pp 317–327 (2011)

  30. Li, T., Ogihara, M.: Toward intelligent music information retrieval. IEEE Trans. Multimedia 8(3), 564–574 (2006). https://doi.org/10.1109/TMM.2006.870730

    Article  Google Scholar 

  31. Lidy, T., Rauber, A.: Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proceedings of the Sixth International Conference on Music Information Retrieval (ISMIR 2005), pp 34–41 (2005)

  32. Lim, S., Lee, J., Jang, S., Lee, S., Kim, M.Y.: Music-genre classification system based on spectro-temporal features and feature selection. IEEE Trans. Consum. Electron. 58(4), 1262–1268 (2012). https://doi.org/10.1109/TCE.2012.6414994

    Article  Google Scholar 

  33. Martens, J.P., Leman, M., Baets, B., Meyer, H.: A comparison of human and automatic musical genre classification. IEEE Int. Conf. Acoustics Speech Signal Process. 4, 233–236 (2004)

    Google Scholar 

  34. McKay, C., Fujinaga, I.: Improving automatic music classification performance by extracting features from different types of data. In: Proceedings of the International Conference on Multimedia Information Retrieval. pp. 257–266 (2010). https://doi.org/10.1145/1743384.1743430

  35. Mitrović, D., Zeppelzauer, M., Breiteneder, C.: Features for content-based audio retrieval. In: Advances in Computers: Improving the Web, vol 78, Elsevier. pp .71–150 (2010). https://doi.org/10.1016/S0065-2458(10)78003-7

  36. Muller, F., Mertins, A.: On using the auditory image model and invariant-integration for noise robust automatic speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4905–4908 (2012). https://doi.org/10.1109/ICASSP.2012.6289019

  37. Munkong, R., Juang, B.: Auditory perception and cognition. IEEE Signal Process. Mag. 25(3), 98–117 (2008). https://doi.org/10.1109/MSP.2008.918418

    Article  Google Scholar 

  38. Nanni, L., Costa, Y., Lumini, A., Kim, M.Y., Baek, S.R.: Combining visual and acoustic features for music genre classification. Expert Syst. Appl. 45, 108–117 (2016). https://doi.org/10.1016/j.eswa.2015.09.018

    Article  Google Scholar 

  39. Nanni, L., Costa, Y., Lucio, D., Silla, C., Brahnam, S.: Combining visual and acoustic features for audio classification tasks. Pattern Recogn. Lett. 88, 49–56 (2017). https://doi.org/10.1016/j.patrec.2017.01.013

    Article  Google Scholar 

  40. Nonaka, R., Emoto, T., Abeyratne, U.R., Jinnouchi, O., Kawata, I., Ohnishi, H., Akutagawa, M., Konaka, S., Kinouchi, Y.: Automatic snore sound extraction from sleep sound recordings via auditory image modeling. Biomed. Signal Process. Control 27, 7–14 (2016). https://doi.org/10.1016/j.bspc.2015.12.009

    Article  Google Scholar 

  41. Nosaka, R., Suryanto, C.H., Fukui, K.: Rotation invariant co-occurrence among adjacent lbps. In: Park, J.I., Kim, J. (eds.) Computer Vision - ACCV 2012 Workshops, pp. 15–25. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  42. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002). https://doi.org/10.1109/TPAMI.2002.1017623

    Article  MATH  Google Scholar 

  43. Ojansivu, V., Heikkilä, J.: Blur insensitive texture classification using local phase quantization. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) Image and Signal Processing, pp. 236–243. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  44. Panagakis, Y., Kotropoulos, C.L., Arce, G.R.: Music genre classification using locality preserving non-negative tensor factorization and sparse representations. In: ISMIR, pp 249–254 (2009)

  45. Panagakis, Y., Kotropoulos, C.L., Arce, G.R.: Music genre classification via joint sparse low-rank representation of audio features. IEEE/ACM Trans. Audio Speech Language Process. 22(12), 1905–1917 (2014). https://doi.org/10.1109/TASLP.2014.2355774

    Article  Google Scholar 

  46. Patterson, R., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C., Allerhand, M.: Complex sounds and auditory images. In: Cazals, Y., Horner, K., Demany, L. (eds) Auditory Physiology and Perception, Pergamon. pp. 429–446 (1992). https://doi.org/10.1016/B978-0-08-041847-6.50054-X

  47. Patterson, R.D., Allerhand, M.H., Giguère, C.: Time-domain modeling of peripheral auditory processing: a modular architecture and a software platform. Acoust. Soc. Am. J. 98(4), 1890–1894 (1995). https://doi.org/10.1121/1.414456

    Article  Google Scholar 

  48. Perrot, D., Gjerdigen, R.: Scanning the dial: an exploration of factors in the identification of musical style. In: Proceedings of the 1999 Society for Music Perception and Cognition, p 88 (1999)

  49. Qiu, L., Li, S., Sung, Y.: 3D-DCDAE: Unsupervised music latent representations learning method based on a deep 3d convolutional denoising autoencoder for music genre classification. Mathematics 9(18), 2274 (2021). https://doi.org/10.3390/math9182274

    Article  Google Scholar 

  50. Qiu, L., Li, S., Sung, Y.: DBTMPE: Deep bidirectional transformers-based masked predictive encoder approach for music genre classification. Mathematics 9(5), 530 (2021). https://doi.org/10.3390/math9050530

    Article  Google Scholar 

  51. Schindler, A., Rauber, A.: An audio-visual approach to music genre classification through affective color features. In: Hanbury A, Kazai G, Rauber A, Fuhr N (eds) Advances in Information Retrieval. pp. 61–67 (2015). https://doi.org/10.1007/978-3-319-16354-3_8

  52. Sturm, B.L.: The GTZAN dataset: its contents, its faults, their effects on evaluation, and its future use. CoRR abs/1306.1461:1–29 (2013)

  53. Tsaptsinos, A.: Lyrics-based music genre classification using a hierarchical attention network. CoRR abs/1707.04678 (2017)

  54. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002). https://doi.org/10.1109/TSA.2002.800560

    Article  Google Scholar 

  55. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009). https://doi.org/10.1109/TPAMI.2008.79

    Article  Google Scholar 

  56. Wu, M., Chen, Z., Jang, J.R., Ren, J., Li, Y., Lu, C.: Combining visual and acoustic features for music genre classification. In: 2011 10th International Conference on Machine Learning and Applications and Workshops, vol 2, pp. 124–129 (2011). https://doi.org/10.1109/ICMLA.2011.48

  57. Yang, H., Zhang, W.Q.: Music genre classification using duplicated convolutional layers in neural networks. In: Proc. Interspeech 2019, pp. 3382–3386 (2019). https://doi.org/10.21437/Interspeech.2019-1298

  58. Ylioinas, J., Hadid, A., Guo, Y., Pietikäinen, M.: Efficient image appearance description using dense sampling based local binary patterns. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) Computer Vision - ACCV 2012, pp. 375–388. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  59. Yu, Y., Luo, S., Liu, S., Qiao, H., Liu, Y., Feng, L.: Deep attention based music genre classification. Neurocomputing 372, 84–91 (2020). https://doi.org/10.1016/j.neucom.2019.09.054

    Article  Google Scholar 

  60. Zhao, G., Ahonen, T., Matas, J., Pietikainen, M.: Rotation-invariant image and video description with local binary pattern features. IEEE Trans. Image Process. 21(4), 1465–1477 (2012). https://doi.org/10.1109/TIP.2011.2175739

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank all the referees and the editorial board members for their insightful comments and suggestions, which improved our paper significantly. This study was funded by the National Natural Science Foundation of China under the Grants No. 11501351.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongjuan Zhang.

Additional information

Communicated by P. Pala.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, X., Zhang, H. Music genre classification based on auditory image, spectral and acoustic features. Multimedia Systems 28, 779–791 (2022). https://doi.org/10.1007/s00530-021-00886-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00886-3

Keywords

Navigation