Skip to main content

The Deep Learning Revolution in MIR: The Pros and Cons, the Needs and the Challenges

  • Conference paper
  • First Online:
Perception, Representations, Image, Sound, Music (CMMR 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12631))

Included in the following conference series:

Abstract

This paper deals with the deep learning revolution in Music Information Research (MIR), i.e. the switch from knowledge-driven hand-crafted systems to data-driven deep-learning systems. To discuss the pro and cons of this revolution, we first review the basic elements of deep learning and explain how those can be used for audio feature learning or for solving difficult MIR tasks. We then discuss the case of hand-crafted features and demonstrate that, while those where indeed shallow and explainable at the start, they tended to be deep, data-driven and unexplainable over time, already before the reign of deep-learning. The development of these data-driven approaches was allowed by the increasing access to large annotated datasets. We therefore argue that these annotated datasets are today the central and most sustainable element of any MIR research. We propose new ways to obtain those at scale. Finally we highlight a set of challenges to be faced by the deep learning revolution in MIR, especially concerning the consideration of music specificities, the explainability of the models (X-AI) and their environmental cost (Green-AI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://mires.cc/files/MIRES_Roadmap_ver_1.0.0.pdf.

  2. 2.

    “More recently, deep learning techniques have been used for automatic feature learning in MIR tasks, where they have been reported to be superior to the use of hand-crafted feature sets for classification tasks, although these results have not yet been replicated in MIREX evaluations. It should be noted however that automatically generated features might not be musically meaningful, which limits their usefulness.”.

  3. 3.

    such as the adjacent pixels that form a “cat’s ear”.

  4. 4.

    such as \(\vec {W}_{ij}^{[l]}\) represeting a “cat’s ears” detectors.

  5. 5.

    1 s of an audio signal with a sampling rate of 100 Hz is a vector of dimension 44 100.

  6. 6.

    non-synthetic.

  7. 7.

    or more elaborated versions of it.

  8. 8.

    or more elaborated algorithms.

  9. 9.

    Consider the case of “Blurred Lines” by Pharrell Williams and Robin Thicke and “Got to Give It Up” by Marvin Gaye.

  10. 10.

    https://research.fb.com/publications/a-universal-music-translation-network/.

  11. 11.

    The “timbre spaces” are the results of a Multi-Dimensional-Scaling (MDS) analysis of similarity/dissimilarity user ratings between pairs of sounds as obtained through perceptual experiments [78].

  12. 12.

    The idea of using DL for representation learning in audio was initially proposed in the case of speech as described in [52].

  13. 13.

    using an Expectation-Maximization algorithm.

  14. 14.

    The citation figures are derived from Google Scholar as of December 15th, 2020.

  15. 15.

    The first period encloses all the models from the “connectionist speech recognition” approaches [14], “tandem features” [48] “bottleneck features” [45] until the seminal paper of [52] (which defines the new baseline for speech recognition system as the DNN-HMM model).

  16. 16.

    where a sound x(t) is considered as the results of the convolution of a periodic source signal s(t) with a filter h(t): \(x(t) = (s * e) (t)\).

  17. 17.

    where a sound with a pitch \(f_0\) is represented in the spectral domain as a set of harmonically related components at frequencies \(h f_0, h \in \mathbb {N}^+\) with amplitudes \(a_h\).

References

  1. Andén, J., Lostanlen, V., Mallat, S.: Joint time-frequency scattering for audio classification. In: Proceedings of IEEE MLSP (International Workshop on Machine Learning for Signal Processing) (2015)

    Google Scholar 

  2. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of IEEE ICCV (International Conference on Computer Vision) (2017)

    Google Scholar 

  3. Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_27

    Chapter  Google Scholar 

  4. Atlas, L., Shamma, S.A.: Joint acoustic and modulation frequency. EURASIP J. Adv. Signal Process. 2003(7), 1–8 (2003)

    Article  Google Scholar 

  5. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Proceedings of NIPS (Conference on Neural Information Processing Systems) (2016)

    Google Scholar 

  6. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR (International Conference on Learning Representations) (2015)

    Google Scholar 

  7. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

  8. Ballet, G., Borghesi, R., Hoffman, P., Lévy, F.: Studio online 3.0: an internet ’killer application’ for remote access to ircam sounds and processing tools. In: Proceeding of JIM (Journées d’Informatique Musicale). Issy-Les-Moulineaux, France (1999)

    Google Scholar 

  9. Basaran, D., Essid, S., Peeters, G.: Main melody extraction with source-filter NMF and C-RNN. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Paris, France, 23–27 September 2018

    Google Scholar 

  10. Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Miami, Florida, USA (2011)

    Google Scholar 

  11. Bittner, R., McFee, B., Salamon, J., Li, P., Bello, J.P.: Deep salience representations for f0 estimation in polyphonic music. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October 2017

    Google Scholar 

  12. Bittner, R.M., Salamon, J., Tierney, M., Mauch, M., Cannam, C., Bello, J.P.: Medleydb: a multitrack dataset for annotation-intensive MIR research. ISMIR 14, 155–160 (2014)

    Google Scholar 

  13. Bogdanov, D., et al.: Essentia: an audio analysis library for music information retrieval. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Curitiba, PR, Brazil (2013)

    Google Scholar 

  14. Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition A Hybrid Approach, vol. 247. Springer, US (1994)

    Book  Google Scholar 

  15. Bridle, J.S., Brown, M.D.: An experimental automatic word recognition system. JSRU report 1003(5), 33 (1974)

    Google Scholar 

  16. Brown, G.J., Cooke, M.: Computational auditory scene analysis. Comput. Speech Lang. 8(4), 297–336 (1994)

    Article  Google Scholar 

  17. Charbuillet, C., Tardieu, D., Peeters, G.: GMM supervector for content based music similarity. In: Proceeding of DAFx (International Conference on Digital Audio Effects), Paris, France, pp. 425–428, September 2011

    Google Scholar 

  18. Chi, T., Ru, P., Shamma, S.A.: Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118(2), 887–906 (2005)

    Article  Google Scholar 

  19. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP (Conference on Empirical Methods in Natural Language Processing) (2014)

    Google Scholar 

  20. Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks. In: Proceedings of ISMIR (International Society for Music Information Retrieval), New York, USA (2016)

    Google Scholar 

  21. Cohen-Hadria, A., Roebel, A., Peeters, G.: Improving singing voice separation using deep u-net and wave-u-net with data augmentation. In: Proceeding of EUSIPCO (European Signal Processing Conference), Coruña, Spain, 2–6 September 2019

    Google Scholar 

  22. Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: FMA: a dataset for music analysis. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October

    Google Scholar 

  23. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  24. Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: a generative model for music. arXiv preprint arXiv:2005.00341 (2020)

  25. Dieleman, S.: Recommending music on spotify with deep learning. Technical report (2014). http://benanne.github.io/2014/08/05/spotify-cnns.html

  26. Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)

    Google Scholar 

  27. Doras, G., Peeters, G.: Cover detection using dominant melody embeddings. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Delft, The Netherlands, 4–8 November 2019

    Google Scholar 

  28. Doras, G., Peeters, G.: A prototypical triplet loss for cover detection. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Barcelona, Spain, 4–8 May 2020

    Google Scholar 

  29. Doras, G., Yesiler, F., Serra, J., Gomez, E., Peeters, G.: Combining musical features for cover detection. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Montreal, Canada, 11–15 October 2020

    Google Scholar 

  30. Durrieu, J.L., Richard, G., David, B., Févotte, C.: Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Trans. Audio Speech Lang. Process. 18(3), 564–575 (2010)

    Article  Google Scholar 

  31. Eghbal-zadeh, H., Lehner, B., Schedl, M., Gerhard, W.: I-vectors for timbre-based music similarity and music artist classification. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Malaga, Spain (2015)

    Google Scholar 

  32. Elizalde, B., Lei, H., Friedland, G.: An i-vector representation of acoustic environments for audio-based video event detection on user generated content. In: 2013 IEEE International Symposium on Multimedia, pp. 114–117. IEEE (2013)

    Google Scholar 

  33. Ellis, D.P.W., Zeng, X., McDermott, J.: Classifying soundtracks with audio texture features. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), pp. 5880–5883. IEEE (2011)

    Google Scholar 

  34. Emiya, V., Badeau, R., David, B.: Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans. Audio Speech Lang. Process. 18(6), 1643–1654 (2010)

    Article  Google Scholar 

  35. Engel, J., Hantrakul, L., Gu, C., Roberts, A.: DDSP: differentiable digital signal processing. In: Proceeding of ICLR (International Conference on Learning Representations) (2020)

    Google Scholar 

  36. Esling, P., Bazin, T., Bitton, A., Carsault, T., Devis, N.: Ultra-light deep LIR by trimming lottery tickets. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Montreal, Canada, 11–15 October 2020

    Google Scholar 

  37. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: Proceeding of ICLR (International Conference on Learning Representations) (2019)

    Google Scholar 

  38. Fujishima, T.: Realtime chord recognition of musical sound: a system using common lisp music. In: Proceedings of ICMC (International Computer Music Conference), pp. 464–467. Beijing, China (1999)

    Google Scholar 

  39. Fukushima, K., Miyake, S.: Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Amari, S., Arbib, M.A. (eds.) Competition and Cooperation in Neural Nets, pp. 267–285. Springer, Heidelberg (1982)

    Chapter  Google Scholar 

  40. Gfeller, B., Frank, C., Roblek, D., Sharifi, M., Tagliasacchi, M., Velimirović, M.: Spice: self-supervised pitch estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1118–1128 (2020)

    Article  Google Scholar 

  41. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  42. Goto, M.: Aist annotation for the RWC music database. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Victoria, BC, Canada, pp. 359–360 (2006)

    Google Scholar 

  43. Goto, M., Hashiguchi, H., Nishimura, T., Oka, R.: RWC music database: popular, classical, and jazz music databases. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Paris, France, pp. 287–288 (2002)

    Google Scholar 

  44. Greenberg, S., Kingsbury, B.E.: The modulation spectrogram: In pursuit of an invariant representation of speech. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), vol. 3, pp. 1647–1650. IEEE (1997)

    Google Scholar 

  45. Grézl, F., Karafiát, M., Kontár, S., Cernocky, J.: Probabilistic and bottle-neck features for lvcsr of meetings. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), vol. 4, pp. IV-757. IEEE (2007)

    Google Scholar 

  46. Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., Pineau, J.: Towards the systematic reporting of the energy and carbon footprints of machine learning. arXiv preprint arXiv:2002.05651 (2020)

  47. Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 3–19. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_1

    Chapter  Google Scholar 

  48. Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional hmm systems. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), vol. 3, pp. 1635–1638. IEEE (2000)

    Google Scholar 

  49. Herrera, P., Yeterian, A., Gouyon, F.: Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques. In: Proceedings of ICMAI (International Conference on Music and Artificial Intelligence), Edinburgh, Scotland (2002)

    Google Scholar 

  50. Herrera, P.: MIRages: an account of music audio extractors, semantic description and context-awareness, in the three ages of MIR. Ph.D. thesis, Music Technology Group (MTG), Universitat Pompeu Fabra, Barcelona (2018)

    Google Scholar 

  51. Hiller Jr, L.A., Isaacson, L.M.: Musical composition with a high speed digital computer. In: Audio Engineering Society Convention 9. Audio Engineering Society (1957)

    Google Scholar 

  52. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  53. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  Google Scholar 

  54. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  55. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications. Neurocomputing 70(1–3), 489–501 (2006)

    Article  Google Scholar 

  56. Humphrey, E.J.: Tutorial: Deep learning in music informatics, demystifying the dark art, Part III - practicum. In: Proceeding of ISMIR (International Society for Music Information Retrieval), Curitiba, PR, Brazil (2013)

    Google Scholar 

  57. Humphrey, E.J., Bello, J.P., LeCun, Y.: Moving beyond feature design: deep architectures and automatic feature learning in music informatics. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Porto, Portugal (2012)

    Google Scholar 

  58. Jansson, A.: Musical Source Separation with Deep Learning and Large-Scale Datasets. Ph.D. thesis, City University of Mondon (2020)

    Google Scholar 

  59. Jansson, A., Humphrey, E.J., Montecchio, N., Bittner, R., Kumar, A., Weyde, T.: Singing voice separation with deep u-net convolutional networks. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October 2017

    Google Scholar 

  60. Jensen, K., Arnspang, K.: Binary decision tree classification of musical sounds. In: Proceedings of ICMC (International Computer Music Conference), Bejing, China (1999)

    Google Scholar 

  61. Kenny, P., Boulianne, G., Ouellet, P., Dumouchel, P.: Speaker and session variability in GMM-based speaker verification. IEEE Trans. Audio Speech Lang. Process. 15(4), 1448–1460 (2007)

    Article  Google Scholar 

  62. Kereliuk, C., Sturm, B.L., Larsen, J.: Deep learning and music adversaries. IEEE Trans. Multimedia 17(11), 2059–2071 (2015)

    Article  Google Scholar 

  63. Kim, T., Lee, J., Nam, J.: Sample-level CNN architectures for music auto-tagging using raw waveforms. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing) (2018)

    Google Scholar 

  64. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of ICLR (International Conference on Learning Representations) (2013)

    Google Scholar 

  65. Korzeniowski, F., Widmer, G.: Feature learning for chord recognition: The deep chroma extractor. In: Proceedings of ISMIR (International Society for Music Information Retrieval), New York, USA, 7–11 August 2016

    Google Scholar 

  66. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of NIPS (Conference on Neural Information Processing Systems), pp. 1097–1105 (2012)

    Google Scholar 

  67. Lartillot, O., Toiviainen, P.: A matlab toolbox for musical feature extraction from audio. In: Proceeding of DAFx (International Conference on Digital Audio Effects), pp. 237–244. Bordeaux (2007)

    Google Scholar 

  68. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  69. Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789 (2017)

  70. Leveau, P., Vincent, E., Richard, G., Daudet, L.: Instrument-specific harmonic atoms for mid-level music representation. IEEE Trans. Audio Speech Lang. Process. 16(1), 116–128 (2007)

    Article  Google Scholar 

  71. Liu, G., Hansen, J.H.: An investigation into back-end advancements for speaker recognition in multi-session and noisy enrollment scenarios. IEEE Trans. Audio Speech Lang. Process. 22(12), 1978–1992 (2014)

    Article  Google Scholar 

  72. Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Plymouth, Massachusetts, USA (2000)

    Google Scholar 

  73. Lundberg, S.M., et al.: Explainable AI for trees: from local explanations to global understanding. arXiv preprint arXiv:1905.04610 (2019)

  74. Mallat, S.: Understanding deep convolutional networks. PhilosophicalTransactions (2016)

    Google Scholar 

  75. Marchetto, E., Peeters, G.: Automatic recognition of sound categories from their vocal imitation using audio primitives automatically found by SI-PLCA and HMM. In: Aramaki, M., Davies, M.E.P., Kronland-Martinet, R., Ystad, S. (eds.) CMMR 2017. LNCS, vol. 11265, pp. 3–22. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01692-0_1

    Chapter  Google Scholar 

  76. Mathieu, B., Essid, S., Fillon, T., Prado, J., Richard, G.: Yaafe, an easy to use and efficient audio feature extraction software. In: Proceedings of ISMIR (International Society for Music Information Retrieval), pp. 441–446. Utrecht, The Netherlands (2010)

    Google Scholar 

  77. Mauch, M., et al.: Omras2 metadata project 2009. In: Late-Breaking/Demo Session of ISMIR (International Society for Music Information Retrieval), Kobe, Japan (2009)

    Google Scholar 

  78. McAdams, S., Windsberg, S., Donnadieu, S., DeSoete, G., Krimphoff, J.: Perceptual scaling of synthesized musical timbres: common dimensions specificities and latent subject classes. Psychol. Res. 58, 177–192 (1995)

    Article  Google Scholar 

  79. McDermott, J., Simoncelli, E.: Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron 71(5), 926–940 (2011)

    Article  Google Scholar 

  80. McFee, B., et al.: librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, vol. 8, pp. 18–25 (2015)

    Google Scholar 

  81. Meseguer Brocal, G., Cohen-Hadria, A., Peeters, G.: Dali: a large dataset of synchronized audio, lyrics and pitch, automatically created using teacher-student. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Paris, France, 23–27 September 2018

    Google Scholar 

  82. Meseguer Brocal, G., Peeters, G.: Creation of a large dataset of synchronised audio, lyrics and notes, automatically created using a teacher-student paradigm. Trans. Int. Soc. Music Inf. Retrieval 3(1), 55–67. https://doiorg/105334/tismir30 2020

    Google Scholar 

  83. Mor, N., Wolf, L., Polyak, A., Taigman, Y.: A universal music translation network. In: Proceedings of ICLR (International Conference on Learning Representations) (2019)

    Google Scholar 

  84. Nieto, O., McCallum, M., Davies, M., Robertson, A., Stark, A., Egozy, E.: The harmonix set: Beats, downbeats, and functional segment annotations of western popular music. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Delft, The Netherlands, 4–8 November

    Google Scholar 

  85. Noé, P.G., Parcollet, T., Morchid, M.: Cgcnn: Complex gabor convolutional neural network on raw speech. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Barcelona, Spain, 4–8 May 2020

    Google Scholar 

  86. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  87. Opolko, F., Wapnick, J.: Mcgill university master samples cd-rom for samplecellvolume 1 (1991)

    Google Scholar 

  88. Pachet, F., Zils, A.: Automatic extraction of music descriptors from acoustic signals. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Barcelona (Spain) (2004)

    Google Scholar 

  89. Parekh, J., Mozharovskyi, P., d’Alche Buc, F.: A framework to learn with interpretation. arXiv preprint arXiv:2010.09345 (2020)

  90. Peeters, G.: A large set of audio features for sound description (similarity and classification) in the cuidado project. Cuidado project report, Ircam (2004)

    Google Scholar 

  91. Peeters, G., Rodet, X.: Hierachical gaussian tree with inertia ratio maximization for the classification of large musical instrument database. In: Proceedingg of DAFx (International Conference on Digital Audio Effects), pp. 318–323. London, UK (2003). peeters03c

    Google Scholar 

  92. Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: Proceedings of IEEE CBMI (International Workshop on Content-Based Multimedia Indexing) (2016)

    Google Scholar 

  93. Pons, J., Serra, X.: Randomly weighted cnns for (music) audio classification. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing) (2019)

    Google Scholar 

  94. Ramona, M., Richard, G., David, B.: Vocal detection in music with support vector machines. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Las Vegas, Nevada, USA, pp. 1885–1888 (2008)

    Google Scholar 

  95. Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT). pp. 1021–1028. IEEE (2018)

    Google Scholar 

  96. Reynolds, D., Quatieri, T., Dunn, R.: Speaker verification using adapted gaussian mixture models. Digit. Signal Proc. 10(1–3), 19–41 (2000)

    Article  Google Scholar 

  97. Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  98. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  99. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)

    Article  Google Scholar 

  100. Sainath, T.N.: Towards end-to-end speech recognition using deep neural networks. In: Proceedings of ICML (International Conference on Machine Learning) (2015)

    Google Scholar 

  101. Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), pp. 4580–4584. IEEE (2015)

    Google Scholar 

  102. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  103. Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.Y.: On random weights and unsupervised feature learning. In: Proceeding of ICML (International Conference on Machine Learning), vol. 2, p. 6 (2011)

    Google Scholar 

  104. Schreiner, C.E., Urbas, J.V.: Representation of amplitude modulation in the auditory cortex of the cat. i. the anterior auditory field (aaf). Hearing Res. 21(3), 227–241 (1986)

    Google Scholar 

  105. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recognition), pp. 815–823 (2015)

    Google Scholar 

  106. Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. CACM, Assoc. Comput. Mach. 63, 54–63 (2020)

    Google Scholar 

  107. Serrà, J., Gomez, E., Herrera, P., Serra, X.: Chroma binary similarity and local alignment applied to cover song identification. IEEE Trans. Audio Speech Lang. Process. (2008)

    Google Scholar 

  108. Serra, X., et al.: Roadmap for Music Information Research. Creative Commons BY-NC-ND 3.0 license (2013). ISBN: 978-2-9540351-1-6

    Google Scholar 

  109. Serra, X., Smith, J.: Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Comput. Music J. 14(4), 12–24 (1990)

    Article  Google Scholar 

  110. Seyerlehner, K.: Content-based music recommender systems: beyond simple frame-level audio similarity. Ph.D. thesis, Johannes Kepler Universität, Linz, Austria, December 2010

    Google Scholar 

  111. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. In: Proceedings of ICLR (International Conference on Learning Representations) (2014)

    Google Scholar 

  112. Smaragdis, P., Brown, J.C.: Non-negative matrix factorization for polyphonic music transcription. In: Proceedings of IEEE WASPAA (Workshop on Applications of Signal Processing to Audio and Acoustics), New Paltz, NY, USA, pp. 177–180. IEEE (2003)

    Google Scholar 

  113. Smaragdis, P., Venkataramani, S.: A neural network alternative to non-negative audio models. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing). pp. 86–90. IEEE (2017)

    Google Scholar 

  114. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. Proceedings of ACL (Conference of the Association for Computational Linguistics) (2019)

    Google Scholar 

  115. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of NIPS (Conference on Neural Information Processing Systems), pp. 3104–3112 (2014)

    Google Scholar 

  116. Szegedy, C., et al.: Intriguing properties of neural networks. In: Proceedings of ICLR (International Conference on Learning Representations) (2013)

    Google Scholar 

  117. Tzanetakis, G., Cook, P.: Marsyas: a framework for audio analysis. OrganisedSound 4(3) (1999)

    Google Scholar 

  118. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)

    Article  Google Scholar 

  119. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS (Conference on Neural Information Processing Systems), pp. 5998–6008 (2017)

    Google Scholar 

  120. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recognition), pp. 3156–3164 (2015)

    Google Scholar 

  121. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. In: Readings in Speech Recognition, pp. 393–404. Elsevier (1990)

    Google Scholar 

  122. Wakefield, G.H.: Mathematical representation of joint time-chroma distributions. In: Proceedings of SPIE conference on Advanced Signal Processing Algorithms, Architecture and Implementations, Denver, Colorado, USA, pp. 637–645 (1999)

    Google Scholar 

  123. Won, M., Chun, S., Nieto, O., Serra, X.: Data-driven harmonic filters for audio representation learning. In: Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech, and Signal Processing), Barcelona, Spain, 4–8 May 2020

    Google Scholar 

  124. Wu, C.W., Lerch, A.: Automatic drum transcription using the student-teacher learning paradigm with unlabeled music data. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Suzhou, China, 23–27 October 2017

    Google Scholar 

  125. Zalkow, F., Müller, M.: Using weakly aligned score-audio pairs to train deep chroma models for cross-modal music retrieval. In: Proceedings of ISMIR (International Society for Music Information Retrieval), Montreal, Canada, 11–15 October 2020, pp. 184–191

    Google Scholar 

  126. Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The Sound of Pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35

    Chapter  Google Scholar 

Download references

Acknowledgments

The content of this paper is partly based on a set of keynotes given at the “Berlin Interdisciplinary Workshop on Timbre” (2017/01), at the “Digital Music Research Network Workshop” in London (2018/12), at the “International Symposium on Computer Music Multidisciplinary Research” in Marseille (2019/10) and at the “Collège de France” in Paris (2020/02). The author is grateful to the respective organizers for their invitations. The author is also grateful to Giorgia Cantisani and Florence d’Alché-Buc for sharing materials and Laure Pretet for proof-reading.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geoffroy Peeters .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Peeters, G. (2021). The Deep Learning Revolution in MIR: The Pros and Cons, the Needs and the Challenges. In: Kronland-Martinet, R., Ystad, S., Aramaki, M. (eds) Perception, Representations, Image, Sound, Music. CMMR 2019. Lecture Notes in Computer Science(), vol 12631. Springer, Cham. https://doi.org/10.1007/978-3-030-70210-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-70210-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-70209-0

  • Online ISBN: 978-3-030-70210-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics