Skip to main content

Feature learning and deep architectures: new directions for music informatics


As we look to advance the state of the art in content-based music informatics, there is a general sense that progress is decelerating throughout the field. On closer inspection, performance trajectories across several applications reveal that this is indeed the case, raising some difficult questions for the discipline: why are we slowing down, and what can we do about it? Here, we strive to address both of these concerns. First, we critically review the standard approach to music signal analysis and identify three specific deficiencies to current methods: hand-crafted feature design is sub-optimal and unsustainable, the power of shallow architectures is fundamentally limited, and short-time analysis cannot encode musically meaningful structure. Acknowledging breakthroughs in other perceptual AI domains, we offer that deep learning holds the potential to overcome each of these obstacles. Through conceptual arguments for feature learning and deeper processing architectures, we demonstrate how deep processing models are more powerful extensions of current methods, and why now is the time for this paradigm shift. Finally, we conclude with a discussion of current challenges and the potential impact to further motivate an exploration of this promising research area.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    Music Information Retrieval Evaluation eXchange (MIREX):

  2. 2.

    Million Song Dataset.

  3. 3.

    MIR Toolbox, Chroma Toolbox, MARSYAS, Echonest API.


  1. Andén, J., & Mallat, S. (2011). Multiscale scattering for audio classification. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).

  2. Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Audio, Speech and Language Processing, 13(5), 1035–1047.

    Article  Google Scholar 

  3. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.

    MathSciNet  Article  MATH  Google Scholar 

  4. Bengio, Y., Courville, A.C., Vincent, P. (2012). Unsupervised feature learning and deep learning: a review and new perspectives. arXiv:1206.5538.

  5. Bengio, Y., & LeCun, Y. (2007). Scaling learning algorithms towards AI. In Large-Scale Kernel Machines (Vol. 34).

  6. Berenzweig, A., Logan, B., Ellis, D.P., Whitman, B. (2004). A large-scale evaluation of acoustic and subjective music-similarity measures. Computer Music Journal, 28(2), 63–76.

    Article  Google Scholar 

  7. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y. (2010). Theano: A CPU and GPU math expression compiler. In Proc. of the Python for Scientific computing conf. (SciPy).

  8. Bertin-Mahieux, T., & Ellis, D.P.W. (2012). Large-scale cover song recognition using the 2D fourier transform magnitude. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR) (pp. 241–246).

  9. Bishop, C. (2006). Pattern recognition and machine learning. Springer.

  10. Cabral, G., & Pachet, F. (2006). Recognizing chords with EDS: Part One. Computer Music Modeling and Retrieval (pp. 185–195).

  11. Casey, M., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M. (2008). Content-based music information retrieval: current directions and future challenges. Proceedings of the IEEE, 96(4), 668–696.

    Article  Google Scholar 

  12. Cho, T., & Bello, J.P. (2011). A feature smoothing method for chord recognition using recurrence plots. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).

  13. Chordia, P., Sastry, A., Sentürk, S. (2011). Predictive tabla modelling using variable-length markov and hidden markov models. Journal of New Music Research, 40(2), 105–118.

    Article  Google Scholar 

  14. Collobert, R., Kavukcuoglu, K., Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.

  15. Dannenberg, R. (1984). An on-line algorithm for real-time accompaniment. In Proc. Int. Computer Music Conf. (pp. 193–198).

  16. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  17. Dieleman, S., Brakel, P., Schrauwen, B. (2011). Audio-based music classification with a pretrained convolutional network. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).

  18. Dixon, S. (2007). Evaluation of the audio beat tracking system Beatroot. Journal of New Music Research, 36(1), 39–50.

    Article  Google Scholar 

  19. Edward, W., & Kolen, J.F. (1994). Resonance and the perception of musical meter. Connection Science, 6(2–3), 177–208.

    Google Scholar 

  20. Flexer, A., Schnitzer, D., Schlueter, J. (2012). A MIREX meta-analysis of hubness in audio music similarity. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR) (pp. 175–180).

  21. Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using common lisp music. In Proc. int. computer music conf.

  22. Goto, M., & Muraoka, Y. (1995). A real-time beat tracking system for audio signals. In Proc. int. computer music conf. (pp. 171–174).

  23. Grosche, P., & Müller, M. (2011). Extracting predominant local pulse information from music recordings. IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1688–1701.

    Article  Google Scholar 

  24. Hadsell, R., Chopra, S., LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Proc. Computer Vision and Pattern Recognition conf. (CVPR). IEEE Press.

  25. Hamel, P., Wood, S., Eck, D. (2009). Automatic identification of instrument classes in polyphonic and poly-instrument audio. In Proc. 10th Int. Conf. on Music Information Retrieval (ISMIR).

  26. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine. doi:10.1109/MSP.2012.2205597.

    Google Scholar 

  27. Hinton, G.E., Osindero, S., Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    MathSciNet  Article  MATH  Google Scholar 

  28. Humphrey, E.J., & Bello, J.P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In Proc. Int. Conf. on Machine Learning and Applications.

  29. Humphrey, E.J., Bello, J.P., LeCun, Y. (2012). Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR).

  30. Humphrey, E.J., Glennon, A.P., Bello, J.P. (2010). Non-linear semantic embedding for organizing large instrument sample libraries. In Proc. ICMLA.

  31. Klapuri, A., & Davy, M. (2006). Signal processing methods for music transcription. Springer.

  32. Le, Q., Monga, R., Devin, M., Corrado, G., Chen, K., Ranzato, M., Dean, J., Ng, A. (2012). Building high-level features using large scale unsupervised learning. In Proc. Int. Conf. on Machine Learning (ICML).

  33. Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y. (2010). Tiled convolutional neural networks. In Advances in Neural Information Processing Systems (Vol. 23).

  34. Le Roux, N., & Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6), 1631–1649.

    MathSciNet  Article  MATH  Google Scholar 

  35. LeCun, Y. (2012). Learning invariant feature hierarchies. In Computer vision–ECCV 2012. Workshops and demonstrations (pp. 496–505). Springer.

  36. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data.

  37. Leveau, P., Sodoyer, D., Daudet, L. (2007). Automatic instrument recognition in a polyphonic mixture using sparse representations. In Proc. 8th Int. Conf. on Music Information Retrieval (ISMIR).

  38. Levy, M., Noland, K., Sandler, M. (2007). A comparison of timbral and harmonic music segmentation algorithms. In 2007 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (Vol. 4, pp. 1433–1436). IEEE.

  39. Levy, M., & Sandler, M. (2009). Music information retrieval using social tags and audio. IEEE Transactions on Multimedia, 11(3), 383–395.

    Article  Google Scholar 

  40. Lyon, R., Rehn, M., Bengio, S., Walters, T., Chechik, G. (2010). Sound retrieval and ranking using sparse auditory representations. Neural computation, 22(9), 2390–2416.

    Article  MATH  Google Scholar 

  41. Mandel, M., & Ellis, D. (2005). Song-level features and support vector machines for music classification. In Proc. 6th Int. Conf. on Music Information Retrieval (ISMIR).

  42. Mauch, M., & Dixon, S. (2010). Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech and Language Processing, 18(6), 1280–1289.

    Article  Google Scholar 

  43. McFee, B., & Lanckriet, G. (2012). Hypergraph models of playlist dialects. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR).

  44. Müller, M., Ellis, D., Klapuri, A., Richard, G. (2011). Signal processing for music analysis. Journal Selected Topics in Signal Processing, 5(6), 1088–1110.

    Article  Google Scholar 

  45. Müller, M., & Ewert, S. (2011). Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR). Miami, USA.

  46. Nam, J., Ngiam, J., Lee, H., Slaney, M. (2011). A classification-based polyphonic piano transcription approach using learned feature representations. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).

  47. Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1), 588–601.

    Article  Google Scholar 

  48. Schmidt, E.M., & Kim, Y.E. (2011). Modeling the acoustic structure of musical emotion with deep belief networks. In Proc. neural information processing systems.

  49. Sheh, A., & Ellis, D.P.W. (2003). Chord segmentation and recognition using em-trained hidden markov models. In Proc. 4th Int. Conf. on Music Information Retrieval (ISMIR).

  50. Slaney, M. (2011). Web-scale multimedia analysis: does content matter? IEEE Multimedia, 18(2), 12–15.

    Article  Google Scholar 

  51. Sumi, K., Arai, M., Fujishima, T., Hashimoto, S. (2012). A music retrieval system using chroma and pitch features based on conditional random fields. In 2012 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1997–2000). IEEE.

  52. Zils, A., & Pachet, F. (2004). Automatic extraction of music descriptors from acoustic signals using EDS. In Proc. AES.

Download references

Author information



Corresponding author

Correspondence to Eric J. Humphrey.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Humphrey, E.J., Bello, J.P. & LeCun, Y. Feature learning and deep architectures: new directions for music informatics. J Intell Inf Syst 41, 461–481 (2013).

Download citation


  • Music informatics
  • Deep learning
  • Signal processing