Machine learning for music genre: multifaceted review and experimentation with audioset

Abstract

Music genre classification is one of the sub-disciplines of music information retrieval (MIR) with growing popularity among researchers, mainly due to the already open challenges. Although research has been prolific in terms of number of published works, the topic still suffers from a problem in its foundations: there is no clear and formal definition of what genre is. Music categorizations are vague and unclear, suffering from human subjectivity and lack of agreement. In its first part, this paper offers a survey trying to cover the many different aspects of the matter. Its main goal is give the reader an overview of the history and the current state-of-the-art, exploring techniques and datasets used to the date, as well as identifying current challenges, such as this ambiguity of genre definitions or the introduction of human-centric approaches. The paper pays special attention to new trends in machine learning applied to the music annotation problem. Finally, we also include a music genre classification experiment that compares different machine learning models using Audioset.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    https://www.upf.edu/web/mtg/ismir2004-genre

  2. 2.

    http://www.pandora.com/about/mgp

  3. 3.

    https://github.com/ismir/mir-datasets

    Table 1 Datasets for MGC, ordered by year of publication, specifying the size of the dataset, the number of classes or labels and the features each dataset provides
  4. 4.

    http://the.echonest.com/

  5. 5.

    https://research.google.com/audioset/

  6. 6.

    July 2019. Statistics gathered from https://acousticbrainz.org

  7. 7.

    https://freesound.org

  8. 8.

    https://www.freemusicarchive.org

  9. 9.

    http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset

  10. 10.

    http://tagatune.org

  11. 11.

    https://www.last.fm

  12. 12.

    https://www.allmusic.com

  13. 13.

    https://www.discogs.com

References

  1. Abdallah, S.A. (2002). Towards music perception by redundancy reduction and unsupervised learning in probabilistic models. PhD thesis: Queen Mary University of London.

    Google Scholar 

  2. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182.

  3. An, Y., Sun, S., Wang, S. (2017). Naive Bayes classifiers for music emotion classification based on lyrics.

  4. Aucouturier, J.J., & Pachet, F. (2003). Representing musical genre: a state of the art. Journal of New Music Research, 32(1), 83–93.

    Google Scholar 

  5. Aucouturier, J.J., Pachet, F., Roy, P., Beurivé, A. (2007). Signal + context= better classification. In Proc. of the 8th ISMIR conference, Vienna, Austria, pp. 425–430.

  6. Basili, R., Serafini, A., Stellato, A. (2004). Classification of musical genre: a machine learning approach. In: Proc of the 5th ISMIR Conference, Barcelona, Spain.

  7. Bayle, Y., Maršík, L., Rusek, M., Robine, M., Hanna, P., Slaninová, K., Martinovic, J., Pokornỳ, J. (2017). Kara1k: A karaoke dataset for cover song identification and singing voice analysis. In 2017 IEEE International Symposium on Multimedia (ISM). IEEE, 177–184.

  8. Benetos, E., & Weyde, T. (2015). An efficient Temporally-Constrained probabilistic model for Multiple-Instrument music transcription. In Proc. of the 16th ISMIR conference, Malaga, Spain, pp. 701–707.

  9. Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P. (2011). The million song dataset. In: Proc. of the 12th ISMIR conference, Miami, USA, pp. 591–596.

  10. Böck, S, Krebs, F., Widmer, G. (2016). Joint beat and downbeat tracking with recurrent neural networks. In Proc. of the 17th ISMIR conference, New York City, USA, pp 255–261.

  11. Bogdanov, D., Wack, N., Gómez Gutiérrez, E., Gulati, S., Herrera Boyer, P., Mayor, O., Roma Trepat, G., Salamon, J., Zapata González, J.R., Serra, X. (2013). Essentia: An audio analysis library for music information retrieval. In Proc. of the 14th ISMIR conference, Curitiba, Brazil, pp. 493–498.

  12. Bogdanov, D., Porter, A., Urbano, J., Schreiber, H. (2017). The mediaeval 2017 acousticbrainz genre task: Content-based music genre recognition from multiple sources. In Proc. of the mediaeval 2016 Workshop. Dublin, Ireland.

  13. Bogdanov, D., Porter, A., Schreiber, H., Urbano, J., Oramas, S. (2019). The acousticBrainz genre dataset: multi-Source, multi-Level, multi-Label, and large-Scale. In: Proc of the 20th ISMIR Conference, Delft, The Netherlands.

  14. Bonnin, G., & Jannach, D. (2015). Automated generation of music playlists: Survey and experiments. ACM Computing Surveys (CSUR), 47(2), 26.

    Google Scholar 

  15. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F. (2013). Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proc. of the 21st ACM international conference on multimedia. Barcelona, Spain, pp. 223–232.

  16. Burges, C.J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121–167.

    Google Scholar 

  17. Burred, J.J., & Lerch, A. (2003). A hierarchical approach to automatic musical genre classification. In: Proc. of the 6th international conference on digital audio effects, pp. 8–11.

  18. Cano, P., Gómez Gutiérrez, E., Gouyon, F., Herrera Boyer, P., Koppenberger, M., Ong, B.S., Serra, X., Streich, S., Wack, N. (2006). ISMIR 2004 audio description contest. Tech rep., Universitat Pompeu Fabra, Music technology Group.

  19. Celma, O. (2010). Music recommendation. In: Music recommendation and discovery, Springer, pp. 43–85.

  20. Chang, K.K., Jang, J.S.R., Iliopoulos, C.S. (2010). Music genre classification via compressive sampling. In Proc. of the 11th ISMIR conference, Utrecht, Netherlands (pp. 387–392).

  21. Choi, K., Fazekas, G., Sandler, M. (2016). Automatic tagging using deep convolutional neural networks. In Proc. of the 17th ISMIR conference, New York City, USA, pp. 805–811.

  22. Choi, K., Fazekas, G., Sandler, M.B., Cho, K. (2017). Transfer learning for music classification and regression tasks. In Proc. of the 18th ISMIR conference, Suzhou, China, pp. 141–149.

  23. Chollet, F., & et al. (2015). Keras. https://keras.io.

  24. Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 deep learning and representation learning workshop.

  25. Conneau, A., Schwenk, H., Barrault, L., LeCun, Y. (2016). Very deep convolutional networks for natural language processing. arXiv:1606.01781.

  26. Corrêa, D C, & Rodrigues, F.A. (2016). A survey on symbolic data-based music genre classification. Expert Systems with Applications, 60, 190–210.

    Google Scholar 

  27. Costa, Y.M., Oliveira, L.S., Silla, Jr C.N. (2017). An evaluation of convolutional neural networks for music classification using spectrograms. Applied soft computing, 52, 28–38.

    Google Scholar 

  28. De Clercq, T., & Temperley, D. (2011). A corpus analysis of rock harmony. Popular Music, 30(1), 47–70.

    Google Scholar 

  29. Dechter, R. (1986). Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems.

  30. Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X. (2017). FMA: A dataset for music analysis. In Proc. of the 18th ISMIR conference, Suzhou, China, pp 316–323.

  31. Delbouys, R., Hennequin, R., Piccoli, F., Royo-letelier, J., Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural net. In Proc. of the 19th ISMIR conference, Paris, France, pp. 370–375.

  32. Deng, L., Yu, D., et al. (2014). Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(3–4), 197–387.

    MathSciNet  MATH  Google Scholar 

  33. Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. In: 2014 IEEE ICASSP, pp 6964–6968.

  34. Downie, J.S. (2003). Music information retrieval. Annual review of information science and technology, 37(1), 295–340.

    Google Scholar 

  35. Egermann, H., Pearce, M.T., Wiggins, G.A., McAdams, S. (2013). Probabilistic models of expectation violation predict psychophysiological emotional responses to live concert music. Cognitive, Affective, & Behavioral Neuroscience, 13(3), 533–553.

    Google Scholar 

  36. Fabbri, F. (1999). Browsing music spaces: Categories and the musical mind. In: Proc. of int. association for the study of popular music.

  37. Fan, J., Tatar, K., Thorogood, M., Pasquier, P. (2017). Ranking-based emotion recognition for experimental music. In: Proc. of the 18th ISMIR conference, Suzhou, China, pp. 368–375.

  38. Fayek, H.M., Lech, M., Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92, 60–68.

    Google Scholar 

  39. Flores, M.J., Gámez, J A, Martínez, A.M. (2012). Supervised classification with bayesian networks: A review on models and applications. Intelligent data analysis for real-life applications: Theory and practice, pp. 72–102.

  40. Fonseca, E., Pons Puig, J., Favory, X., Font Corbera, F., Bogdanov, D., Ferraro, A., Oramas, S., Porter, A., Serra, X. (2017). Freesound datasets: a platform for the creation of open audio datasets. In Proc. of the 18th ISMIR conference, Suzhou, China, pp. 486–493.

  41. Font, F., Roma, G., Serra, X. (2013). Freesound technical demo. In: Proc. of the 21st ACM international conference on Multimedia, ACM, pp. 411–412.

  42. Fu, Z., Lu, G., Ting, K.M., Zhang, D. (2011). A survey of audio-based music classification and annotation. IEEE Trans on multimedia, 13(2), 303–319.

    Google Scholar 

  43. Gao, R., Feris, R., Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In: Proc. of the European conference on computer vision (ECCV), pp 35–53.

  44. Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In: Proc of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780.

  45. Genussov, M., & Cohen, I. (2010). Musical genre classification of audio signals using geometric methods. In: Signal processing conference, 2010 18th European, IEEE, pp. 497–501.

  46. Gibaja, E., & Ventura, S. (2015). A tutorial on multilabel learning. ACM Computing Surveys (CSUR), 47(3), 52.

    Google Scholar 

  47. Gȯmez, J.S., Abeßer, J., Cano, E. (2018). Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning. In Proc. of the 19th ISMIR conference, Paris, France, pp. 577–584.

  48. Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J., Choi, E. (2018). Morphnet: Fast & simple resource-constrained structure learning of deep networks. In: IEEE conference on computer vision and pattern recognition (CVPR).

  49. Gouvert, O., Oberlin, T., Fėvotte, C. (2018). Matrix co-factorization for cold-start recommendation. In Proc. of the 19th ISMIR conference, Paris, France, pp. 792–798.

  50. Gouyon, F., Dixon, S., Pampalk, E., Widmer, G. (2004). Evaluating rhythmic descriptors for musical genre classification. In: Proc. of the AES 25th international conference, pp. 196–204.

  51. Graves, A. (2012). Supervised sequence labelling. In: Supervised sequence labelling with recurrent neural networks, Springer, pp. 5–13.

  52. Graves, A., Mohamed, Ar, Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: IEEE ICASSP, pp. 6645–6649.

  53. Guaus, E. (2009). Audio content processing for automatic music genre classification: descriptors, databases, and classifiers. PhD thesis, Universitat Pompeu Fabra, Barcelona, Spain.

  54. Gururani, S., Summers, C., Lerch, A. (2018). Instrument activity detection in polyphonic music using deep neural networks. In Proc. of the 19th ISMIR conference, Paris, France, pp. 569–576.

  55. Hamel, P., & Eck, D. (2010). Learning features from music audio with deep belief networks. In Proc. of the 11th ISMIR conference, Utrecht, The Netherlands, pp. 339–344.

  56. Han, B.J., Rho, S., Jun, S., Hwang, E. (2010). Music emotion classification and context-based music recommendation. Multimedia Tools and Applications, 47(3), 433–460.

    Google Scholar 

  57. He, K., Zhang, X., Ren, S., Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proc. of the IEEE international conference on computer vision, pp. 1026–1034.

  58. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: Proc. of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778.

  59. Henaff, M., Jarrett, K., Kavukcuoglu, K., Lecun, Y. (2011). Unsupervised learning of sparse features for scalable audio classification. In Proc. of the 12th ISMIR conference, Miami, USA, pp. 681–686.

  60. Hennequin, R., Royo-letelier, J., Moussallam, M. (2018). Audio based disambiguation of music genre tags. In Proc. of the 19th ISMIR conference, Paris, France, pp. 645–652.

  61. Herrera-Boyer, P., Peeters, G., Dubnov, S. (2003). Automatic classification of musical instrument sounds. J of New Music Research, 32(1), 3–21.

    Google Scholar 

  62. Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B., et al. (2017). Cnn architectures for large-scale audio classification. In: Proc. of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 131–135.

  63. Hinton, G.E., Osindero, S., Teh, Y.W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527–1554.

    MathSciNet  MATH  Google Scholar 

  64. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Ar, Mohamed, Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Google Scholar 

  65. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

    Google Scholar 

  66. Hockman, J., Davies, M.E., Fujinaga, I. (2012). One In the jungle: Downbeat detection in hardcore, jungle, and drum and bass. In Proc. of the 13th ISMIR conference, Porto, Portugal, pp. 169–174.

  67. Hoffman, M.D., Blei, D.M., Cook, P.R. (2009). Easy as CBA: A simple probabilistic model for tagging music. In Proc. of the 10th ISMIR conference, Kobe, Japan, pp. 369–374.

  68. Hoffmann, P., & Kostek, B. (2016). Bass enhancement settings in portable devices based on music genre recognition. Journal of the Audio Engineering Society, 63(12), 980–989.

    Google Scholar 

  69. Hssina, B., Merbouha, A., Ezzikouri, H., Erritali, M. (2014). A comparative study of decision tree id3 and c4.5. International Journal of Advanced Computer Science and applications(IJACSA). Special Issue on Advances in Vehicular Ad Hoc Networking and App.lications, 4(2), 2014. https://doi.org/10.14569/SpecialIssue.2014.040203.

    Google Scholar 

  70. Huang, Y.S., Chou, S.Y., Yang, Y.H. (2017). Music thumbnailing via neural attention modeling of music emotion. In: Proc. Asia pacific signal and information processing association annual summit and conference, pp. 347–350.

  71. Hubel, D.H., & Wiesel, T.N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1), 106–154.

    Google Scholar 

  72. Humphrey, E.J., Bello, J.P., LeCun, Y. (2013). Feature learning and deep architectures: New directions for music informatics. J of Intelligent Information Systems, 41(3), 461–481.

    Google Scholar 

  73. Iloga, S., Romain, O., Tchuenté, M. (2018). A sequential pattern mining app.roach to design taxonomies for hierarchical music genre recognition. Pattern Analysis and App.lications, 21(2), 363–380.

    Google Scholar 

  74. Jansen, A., Plakal, M., Pandya, R., Ellis, D., Hershey, S., Liu, J., Moore, C., Saurous, R.A. (2017). Towards learning semantic audio representations from unlabeled data. Signal, 2(3), 7–11.

    Google Scholar 

  75. Kingma, D.P., & Ba, J. (2015). Adam: a method for stochastic optimization. In Proc. of the 3rd international conference on learning representations, ICLR 2015 San Diego, CA USA.

  76. Kitahara, T. (2017). Music generation using bayesian networks. In Altun, Y., Das, K., Mielikäinen, T., Malerba, D., Stefanowski, J., Read, J., žitnik, M., Ceci, M., Džeroski, S. (Eds.) Machine learning and knowledge discovery in databases (pp. 368–372). Cham: Springer International Publishing.

  77. Knees, P., & Schedl, M. (2013). A survey of music similarity and recommendation from music context data. ACM Trans on Multimedia Computing, Communications, and Applications (TOMM), 10(1), 1–21.

    Google Scholar 

  78. Koenigstein, N., Dror, G., Koren, Y. (2011). Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy. In: Proc. of the 5th ACM conference on recommender systems, ACM, pp. 165–172.

  79. Kong, Q., Xu, Y., Wang, W., Plumbley, M.D. (2018). Audio set classification with attention model: A probabilistic perspective. In Proc. of the IEEE international conference on acoustics, speech and signal processing, ICASSP, IEEE, pp. 316–320.

  80. Korvel, G., Treigys, P., Tamulevicus, G., Bernataviciene, J., Kostek, B. (2018). Analysis of 2d feature spaces for deep learning-based speech recognition. Journal of the Audio Engineering Society, 66(12), 1072–1081.

    Google Scholar 

  81. Kostek, B., Kupryjanow, A., Zwan, P., Jiang, W., Ras, Z.W., Wojnarski, M., Swietlicka, J. (2011). Report Of the ISMIS 2011 contest: Music information retrieval. In Proc. of the 19th ISMIS conference, Warsaw, Poland (pp. 715–724).

  82. Kostek, B., Hoffmann, P., Kaczmarek, A., Spaleniak, P. (2014). Creating a reliable music discovery and recommendation system. In: Intelligent tools for building a scientific information platform, From Research to Implementation, Springer, pp. 107–130.

  83. Kotropoulos, C., Arce, G.R., Panagakis, Y. (2010). Ensemble discriminant sparse projections applied to music genre classification. In: International conference on pattern recognition, IEEE, pp. 822–825.

  84. Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105.

  85. Längkvist, M, Karlsson, L., Loutfi, A. (2014). A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters, 42, 11–24.

    Google Scholar 

  86. Larose, D.T., & Larose, C.D. (2014). Discovering knowledge in data: An introduction to data mining. New York: Wiley.

    Google Scholar 

  87. Laurier, C., Meyers, O., Serra, J., Blech, M., Herrera, P. (2009). Music mood annotator design and integration. In: 7th International Workshop on Content-Based Multimedia Indexing. CBMI’09., IEEE, pp. 156–161.

  88. Law, E., & Von Ahn, L. (2009). Input-agreement: a new mechanism for collecting data using human computation games. In: Proc of the SIGCHI conference on human factors in computing systems, ACM, pp. 1197–1206.

  89. Law, E., West, K., Mandel, M.I., Bay, M., Downie, J.S. (2009). Evaluation Of algorithms using games: The case of music tagging. In Proc. of the 10th ISMIR conference, Kobe, Japan, pp. 387–392.

  90. Lee, H., Pham, P., Largman, Y., Ng, A.Y. (2009). Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems, pp. 1096–1104.

  91. Levy, M., & Sandler, M. (2007). A semantic space for music derived from social tags. Austrian Computer Society, 1, 12–17.

    Google Scholar 

  92. Li, T., Ogihara, M., Li, Q. (2003). A comparative study on content-based music genre classification. In: Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM, pp. 282–289.

  93. Libeks, J., & Turnbull, D. (2011). You can judge an artist by an album cover: Using images for music annotation. IEEE MultiMedia, 18(4), 30–37.

    Google Scholar 

  94. Liem, C.C.S., Orio, N., Peeters, G., Schedl, M. (2013). Musiclef 2013: Soundtrack Selection for commercials. In Proc. of the mediaeval 2013 multimedia benchmark workshop, Barcelona, Spain, October 18-19 2013.

  95. Logan, B., & et al. (2000). Mel frequency cepstral coefficients for music modeling. In: Proc of the 1st ISMIR conference, Plymouth, USA.

  96. Mandel, M.I., & Ellis, D. (2005). Song-level features and supp.ort vector machines for music classification. In Proc. of the 6th ISMIR conference, London, UK, pp. 594–599.

  97. Mandel, M.I., & Ellis, D.P. (2008). A web-based game for collecting music metadata. J of New Music Research, 37(2), 151–165.

    Google Scholar 

  98. Marchand, U., & Peeters, G. (2014). The modulation scale spectrum and its application to rhythm-content description. In: Proc. of the 17th international conference on digital audio effects, pp. 167–172.

  99. Mayer, R., Neumayer, R., Rauber, A. (2008). Rhyme and style features for musical genre classification by song lyrics. In Proc. of the 9th ISMIR conference, Philadelphia, USA (pp. 337–342).

  100. McFee, B., & Lanckriet, G.R. (2009). Heterogeneous embedding for subjective artist similarity. In Proc. of the 10th ISMIR conference, Kobe, Japan, pp. 513–518.

  101. McFee, B., & Lanckriet, G.R. (2011). The natural language of playlists. In Proc. of the 12th ISMIR conference, Miami, USA, pp. 537–542.

  102. McFee, B., Bertin-Mahieux, T., Ellis, D.P., Lanckriet, G.R. (2012). The million song dataset challenge. In: Proc. of the 21st international conference on world wide web, ACM, pp. 909–916.

  103. McKay, C., & Fujinaga, I. (2006). Musical genre classification: Is it worth pursuing and how can it be improved?. In Proc. of the 7th ISMIR conference, Victoria, Canada, pp. 101–106.

  104. Medhat, F., Chesmore, D., Robinson, J. (2017). Masked conditional neural networks for audio classification. In: International conference on artificial neural networks, Springer, pp. 349–358.

  105. Menendez, J.A. (2016). Towards a computational account of art cognition: unifying perception, visual art, and music through bayesian inference. Electronic Imaging, 2016 (16), 1–10.

    Google Scholar 

  106. Meyer, L.B. (1957). Meaning in music and information theory. The Journal of Aesthetics and Art Criticism, 15(4), 412–424.

    Google Scholar 

  107. Moore, A.F. (2001). Categorical conventions in music discourse: Style and genre. Music and Letters, 82(3), 432–442.

    Google Scholar 

  108. Müller, M. (2015). Fundamentals of music processing: Audio, analysis, algorithms, applications. Berlin: Springer.

    Google Scholar 

  109. Nair, V., & Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines. In: Proc. of the 27th international conference on machine learning (ICML-10), pp. 807–814.

  110. Nanni, L., Costa, Y.M., Lumini, A., Kim, M.Y., Baek, S.R. (2016). Combining visual and acoustic features for music genre classification. Expert Systems with App.lications, 45, 108–117.

    Google Scholar 

  111. Nanni, L., Costa, Y.M., Aguiar, R.L., Silla, Jr C.N., Brahnam, S. (2018). Ensemble of deep learning, visual and acoustic features for music genre classification. J of New Music Research, pp. 1–15.

  112. Ness, S.R., Theocharis, A., Tzanetakis, G., Martins, L.G. (2009). Improving automatic music tag annotation using stacked generalization of probabilistic svm outputs. In: Proc. of the 17th ACM international conference on Multimedia, pp. 705–708.

  113. Oliphant, T.E. (2006). A guide to NumPy, vol 1. Trelgol Publishing USA.

  114. Olshausen, B.A., & Field, D.J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.

    Google Scholar 

  115. Oramas, S., Nieto, O., Barbieri, F., Serra, X. (2017). Multi-label music genre classification from audio, text and images using deep features. In Proc. of the 18th ISMIR conference, Suzhou, China, pp. 23–30.

  116. Pachet, F., & Cazaly, D. (2000). A taxonomy of musical genres. In Content-based multimedia information access-volume 2, pp. 1238–1245.

  117. Pálmason, H, Jónsson, B.Þ., Amsaleg, L., Schedl, M., Knees, P. (2017a). On competitiveness of nearest-neighbor-based music classification: A methodological critique. In: International conference on similarity search and applications, Springer, pp. 275–283.

  118. Pálmason, H., Jónsson, B.Þ., Schedl, M., Knees, P. (2017b). Music genre classification revisited: An in-depth examination guided by music experts. In: International Symposium on Computer Music Multidisciplinary Research, pp. 49–62.

  119. Panagakis, Y., & Kotropoulos, C. (2010). Music genre classification via topology preserving non-negative tensor factorization and sparse representations. In: Proc of ICASSP, IEEE, pp. 249–252.

  120. Panagakis, Y., Kotropoulos, C., Arce, G.R. (2009). Music genre classification using locality preserving non-negative tensor factorization and sparse representations. In: Proc. of the 10th ISMIR conference, Kobe, Japan, pp. 249–254.

  121. Park, H.S., Yoo, J.O., Cho, S.B. (2006). A context-aware music recommendation system using fuzzy bayesian networks with utility theory. In: International conference on Fuzzy systems and knowledge discovery, Springer, pp. 970–979.

  122. Paulus, J., & Klapuri, A. (2009). Music structure analysis using a probabilistic fitness measure and a greedy search algorithm. IEEE Trans. Audio Speech Language Process., 17(6), 1159–1170.

    Google Scholar 

  123. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12(Oct), 2825–2830.

    MathSciNet  MATH  Google Scholar 

  124. Pickens, J. (2000). A comparison of language modeling and probabilistic text information retrieval app.roaches to monophonic music retrieval. In: Proc of the 1st ISMIR Conference, Plymouth, USA.

  125. Pons, J., Lidy, T., Serra, X. (2016). Experimenting with musically motivated convolutional neural networks. In: 2016 14th international workshop on content-based multimedia indexing (CBMI)., pp. 1–6.

  126. Pons, J., Nieto, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., Serra, X. (2018). End-to-end learning for music audio tagging at scale. In: Proc. of the 19th ISMIR conference, Paris, France, pp. 637–644.

  127. Porter, A., Bogdanov, D., Kaye, R., Tsukanov, R., Serra, X. (2015). Acousticbrainz: A community platform for gathering music information obtained from audio. In: Proc. of the 16th ISMIR conference, Malaga, Spain, pp. 786–792.

  128. Prockup, M., Ehmann, A.F., Gouyon, F., Schmidt, E.M., Celma, O., Kim, Y.E. (2015). Modeling genre with the music genome project: Comparing human-labeled attributes and audio features. In: Proc. of the 16th ISMIR conference, Malaga, Spain, pp 31–37.

  129. Rabiner, L.R., & Juang, B.H. (1993). Fundamentals of speech recognition, vol 14. PTR Prentice Hall Englewood Cliffs.

  130. Rodríguez-Algarra, F., Sturm, B.L., Maruri-Aguilar, H. (2016). Analysing scattering-based music content analysis systems: Where’s the music?. In: Proc. of the 17th ISMIR conference, New York City, USA, pp. 344–350.

  131. Rosner, A., & Kostek, B. (2018). Automatic music genre classification based on musical instrument track separation. Journal of Intelligent Information Systems, 50(2), 363–384.

    Google Scholar 

  132. Schedl, M., Flexer, A., Urbano, J. (2013). The neglected user in music information retrieval research. Journal of Intelligent Information Systems, 41(3), 523–539.

    Google Scholar 

  133. Schmidt, E.M., & Kim, Y.E. (2011a). Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), IEEE, pp. 65–68.

  134. Schmidt, E.M., & Kim, Y.E. (2011b). Modeling musical emotion dynamics with conditional random fields. In: Proc. of the 12th ISMIR conference, Miami, USA, pp 777–782.

  135. Schmidt, E.M., & Kim, Y. (2013). Learning rhythm and melody features with deep belief networks. In: Proc. of the 14th ISMIR conference, Curitiba, Brazil, pp. 21–26.

  136. Schuller, B., Hage, C., Schuller, D., Rigoll, G. (2010). ‘mister dj, cheer me up!’: Musical and textual features for automatic mood classification. J. New Music Res., 39(1), 13–34.

    Google Scholar 

  137. Senac, C., Pellegrini, T., Mouret, F., Pinquier, J. (2017). Music feature maps with convolutional neural networks for music genre classification. In: Proc. of the 15th international workshop on content-based multimedia indexing, ACM, pp. 19–23.

  138. Sigtia, S., & Dixon, S. (2014). Improved music feature learning with deep neural networks. In: IEEE ICASSP, pp. 6959–6963.

  139. Silla, Jr, C.N., Koerich, A.L., Kaestner, C.A. (2008). The latin music database. In: Proc. of the 9th ISMIR conference, Philadelphia, USA, pp. 451–456.

  140. Silla, C.N., Koerich, A.L., Kaestner, C.A.A. (2010). Improving automatic music genre classification with hybrid content-based feature vectors. In: Proc. of the 2010 ACM symposium on applied computing. ACM, pp. 1702–1707.

  141. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In: Proc. of 3rd international conference on learning representations, ICLR, 2015, San Diego, CA, USA.

  142. Smith, E.C., & Lewicki, M.S. (2006). Efficient auditory coding. Nature, 439 (7079), 978–982.

    Google Scholar 

  143. Sturm, B.L. (2012a). An analysis of the gtzan music genre dataset. In: Proc. of the 2nd international ACM workshop on Music information retrieval with user-centered and multimodal strategies, ACM, pp. 7–12.

  144. Sturm, B.L. (2012b). A survey of evaluation in music genre recognition. In: International workshop on adaptive multimedia retrieval, Springer, pp. 29–66.

  145. Sturm, B.L. (2014). The state of the art ten years after a state of the art: Future research in music information retrieval. J. New Music Res., 43(2), 147–172.

    Google Scholar 

  146. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In: Proc. of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826.

  147. Tang, C.P., Chui, K.L., Yu, Y.K., Zeng, Z., Wong, K.H., et al. (2018). Music genre classification using a hierarchical Long Short Term Memory (LSTM) model. In: Proc. of the 3rd international workshop on pattern recognition.

  148. Temperley, D. (2009). A unified probabilistic model for polyphonic music analysis. J. New Music Res., 38(1), 3–18.

    Google Scholar 

  149. Turnbull, D.R., Barrington, L., Lanckriet, G., Yazdani, M. (2009). Combining audio content and social context for semantic music discovery. In: Proc. of the 32nd international ACM SIGIR conference on research and development in information retrieval, ACM, pp. 387–394.

  150. Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Trans. Speech and Audio Process, 10(5), 293–302.

    Google Scholar 

  151. Ulaganathan, A.S., & Ramanna, S. (2018). Granular methods in automatic music genre classification: a case study. J of Intelligent Information Systems pp. 1–21.

  152. Van den Oord, A., Dieleman, S., Schrauwen, B. (2013). Deep content-based music recommendation. In: Advances in neural information processing systems, pp. 2643–2651.

  153. Vigliensoni, G., & Fujinaga, I. (2017). The music listening histories dataset. In: Proc. of the 18th ISMIR conference, Suzhou, China, pp. 96–102.

  154. Vryzas, N., Kotsakis, R., Liatsou, A., Dimoulas, C.A., Kalliris, G. (2018). Speech emotion recognition for performance interaction. J. Audio Eng. Soc., 66(6), 457–467.

    Google Scholar 

  155. Wang, H., & Yeung, D.Y. (2016). Towards bayesian deep learning: a framework and some existing methods. IEEE Trans. Knowledge Data Eng., 28(12), 3395–3408.

    Google Scholar 

  156. Wang, K., An, N., Li, B.N., Zhang, Y., Li, L. (2015). Speech emotion recognition using fourier parameters. IEEE Trans. Affective Comput., 6(1), 69–75.

    Google Scholar 

  157. Wu, T.L., & Jeng, S.K. (2008). Probabilistic estimation of a novel music emotion model. In: International conference on multimedia modeling, Springer, pp. 487–497.

  158. Wu, Y., & Lee, T. (2018).

  159. Xiong, W., Wu, L., Alleva, F., Dropp.o, J., Huang, X., Stolcke, A. (2018). The microsoft 2017 conversational speech recognition system. In: Proc. of the IEEE international conference on acoustics, speech and signal processing, ICASSP, IEEE, pp. 5934–5938.

  160. Xu, Y., Kong, Q., Wang, W., Plumbley, M.D. (2017). Surrey-cvssp system for DCASE2017 challenge task4. arXiv:170900551.

  161. Yang, Y.H., & Chen, H.H. (2012). Machine recognition of music emotion: a review. ACM Trans. Intell. Syst. Technol. (TIST), 3(3), 40.

    MathSciNet  Google Scholar 

  162. Yang, Y.H., & Liu, J.Y. (2013). Quantitative study of music listening behavior in a social and affective context. IEEE Trans. Multimed., 15(6), 1304–1315.

    Google Scholar 

  163. Yang, D., Chen, T., Zhang, W., Lu, Q., Yu, Y. (2012). Local implicit feedback mining for music recommendation. In: Proc. of the 6th ACM conference on Recommender systems, ACM, pp. 91–98.

  164. Yoshii, K., Goto, M., Komatani, K., Ogata, T., Okuno, H.G. (2008). An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model. IEEE Trans on Audio, Speech, and Language Processing, 16(2), 435–447.

    Google Scholar 

  165. Zangerle, E., Gassler, W., Specht, G. (2012). Exploiting twitter’s collective knowledge for music recommendations. In: Proc. of the WWW’12 workshop on ‘making sense of microposts’, lyon, france, April 16, 2012, pp. 14–17.

  166. Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E. (2018). End-to-end speech recognition from the raw waveform. In: Interspeech 2018, 19th annual conference of the international speech communication association, Hyderabad, India, 2018, pp. 781–785.

  167. Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L. (2018). Visual to sound: Generating natural sound for videos in the wild. In: Proc. of the IEEE conference on computer vision and pattern recognition, pp. 3550–3558.

Download references

Acknowledgements

This work has been partially funded by FEDER funds and the Spanish Government (MICINN) through projects SBPLY/17/180501/000493 and TIN2016-77902-C3-1-P.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jaime Ramírez.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ramírez, J., Flores, M.J. Machine learning for music genre: multifaceted review and experimentation with audioset. J Intell Inf Syst 55, 469–499 (2020). https://doi.org/10.1007/s10844-019-00582-9

Download citation

Keywords

  • Machine learning
  • Datasets
  • Music information retrieval
  • Classification algorithms
  • Music
  • Feed-forward neural networks