As we look to advance the state of the art in content-based music informatics, there is a general sense that progress is decelerating throughout the field. On closer inspection, performance trajectories across several applications reveal that this is indeed the case, raising some difficult questions for the discipline: why are we slowing down, and what can we do about it? Here, we strive to address both of these concerns. First, we critically review the standard approach to music signal analysis and identify three specific deficiencies to current methods: hand-crafted feature design is sub-optimal and unsustainable, the power of shallow architectures is fundamentally limited, and short-time analysis cannot encode musically meaningful structure. Acknowledging breakthroughs in other perceptual AI domains, we offer that deep learning holds the potential to overcome each of these obstacles. Through conceptual arguments for feature learning and deeper processing architectures, we demonstrate how deep processing models are more powerful extensions of current methods, and why now is the time for this paradigm shift. Finally, we conclude with a discussion of current challenges and the potential impact to further motivate an exploration of this promising research area.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Music Information Retrieval Evaluation eXchange (MIREX): http://www.music-ir.org/mirex/.
Million Song Dataset.
MIR Toolbox, Chroma Toolbox, MARSYAS, Echonest API.
Andén, J., & Mallat, S. (2011). Multiscale scattering for audio classification. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).
Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Audio, Speech and Language Processing, 13(5), 1035–1047.
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.
Bengio, Y., Courville, A.C., Vincent, P. (2012). Unsupervised feature learning and deep learning: a review and new perspectives. arXiv:1206.5538.
Bengio, Y., & LeCun, Y. (2007). Scaling learning algorithms towards AI. In Large-Scale Kernel Machines (Vol. 34).
Berenzweig, A., Logan, B., Ellis, D.P., Whitman, B. (2004). A large-scale evaluation of acoustic and subjective music-similarity measures. Computer Music Journal, 28(2), 63–76.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y. (2010). Theano: A CPU and GPU math expression compiler. In Proc. of the Python for Scientific computing conf. (SciPy).
Bertin-Mahieux, T., & Ellis, D.P.W. (2012). Large-scale cover song recognition using the 2D fourier transform magnitude. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR) (pp. 241–246).
Bishop, C. (2006). Pattern recognition and machine learning. Springer.
Cabral, G., & Pachet, F. (2006). Recognizing chords with EDS: Part One. Computer Music Modeling and Retrieval (pp. 185–195).
Casey, M., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M. (2008). Content-based music information retrieval: current directions and future challenges. Proceedings of the IEEE, 96(4), 668–696.
Cho, T., & Bello, J.P. (2011). A feature smoothing method for chord recognition using recurrence plots. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).
Chordia, P., Sastry, A., Sentürk, S. (2011). Predictive tabla modelling using variable-length markov and hidden markov models. Journal of New Music Research, 40(2), 105–118.
Collobert, R., Kavukcuoglu, K., Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.
Dannenberg, R. (1984). An on-line algorithm for real-time accompaniment. In Proc. Int. Computer Music Conf. (pp. 193–198).
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.
Dieleman, S., Brakel, P., Schrauwen, B. (2011). Audio-based music classification with a pretrained convolutional network. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).
Dixon, S. (2007). Evaluation of the audio beat tracking system Beatroot. Journal of New Music Research, 36(1), 39–50.
Edward, W., & Kolen, J.F. (1994). Resonance and the perception of musical meter. Connection Science, 6(2–3), 177–208.
Flexer, A., Schnitzer, D., Schlueter, J. (2012). A MIREX meta-analysis of hubness in audio music similarity. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR) (pp. 175–180).
Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using common lisp music. In Proc. int. computer music conf.
Goto, M., & Muraoka, Y. (1995). A real-time beat tracking system for audio signals. In Proc. int. computer music conf. (pp. 171–174).
Grosche, P., & Müller, M. (2011). Extracting predominant local pulse information from music recordings. IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1688–1701.
Hadsell, R., Chopra, S., LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Proc. Computer Vision and Pattern Recognition conf. (CVPR). IEEE Press.
Hamel, P., Wood, S., Eck, D. (2009). Automatic identification of instrument classes in polyphonic and poly-instrument audio. In Proc. 10th Int. Conf. on Music Information Retrieval (ISMIR).
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine. doi:10.1109/MSP.2012.2205597.
Hinton, G.E., Osindero, S., Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Humphrey, E.J., & Bello, J.P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In Proc. Int. Conf. on Machine Learning and Applications.
Humphrey, E.J., Bello, J.P., LeCun, Y. (2012). Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR).
Humphrey, E.J., Glennon, A.P., Bello, J.P. (2010). Non-linear semantic embedding for organizing large instrument sample libraries. In Proc. ICMLA.
Klapuri, A., & Davy, M. (2006). Signal processing methods for music transcription. Springer.
Le, Q., Monga, R., Devin, M., Corrado, G., Chen, K., Ranzato, M., Dean, J., Ng, A. (2012). Building high-level features using large scale unsupervised learning. In Proc. Int. Conf. on Machine Learning (ICML).
Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y. (2010). Tiled convolutional neural networks. In Advances in Neural Information Processing Systems (Vol. 23).
Le Roux, N., & Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6), 1631–1649.
LeCun, Y. (2012). Learning invariant feature hierarchies. In Computer vision–ECCV 2012. Workshops and demonstrations (pp. 496–505). Springer.
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data.
Leveau, P., Sodoyer, D., Daudet, L. (2007). Automatic instrument recognition in a polyphonic mixture using sparse representations. In Proc. 8th Int. Conf. on Music Information Retrieval (ISMIR).
Levy, M., Noland, K., Sandler, M. (2007). A comparison of timbral and harmonic music segmentation algorithms. In 2007 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (Vol. 4, pp. 1433–1436). IEEE.
Levy, M., & Sandler, M. (2009). Music information retrieval using social tags and audio. IEEE Transactions on Multimedia, 11(3), 383–395.
Lyon, R., Rehn, M., Bengio, S., Walters, T., Chechik, G. (2010). Sound retrieval and ranking using sparse auditory representations. Neural computation, 22(9), 2390–2416.
Mandel, M., & Ellis, D. (2005). Song-level features and support vector machines for music classification. In Proc. 6th Int. Conf. on Music Information Retrieval (ISMIR).
Mauch, M., & Dixon, S. (2010). Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech and Language Processing, 18(6), 1280–1289.
McFee, B., & Lanckriet, G. (2012). Hypergraph models of playlist dialects. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR).
Müller, M., Ellis, D., Klapuri, A., Richard, G. (2011). Signal processing for music analysis. Journal Selected Topics in Signal Processing, 5(6), 1088–1110.
Müller, M., & Ewert, S. (2011). Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR). Miami, USA.
Nam, J., Ngiam, J., Lee, H., Slaney, M. (2011). A classification-based polyphonic piano transcription approach using learned feature representations. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).
Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1), 588–601.
Schmidt, E.M., & Kim, Y.E. (2011). Modeling the acoustic structure of musical emotion with deep belief networks. In Proc. neural information processing systems.
Sheh, A., & Ellis, D.P.W. (2003). Chord segmentation and recognition using em-trained hidden markov models. In Proc. 4th Int. Conf. on Music Information Retrieval (ISMIR).
Slaney, M. (2011). Web-scale multimedia analysis: does content matter? IEEE Multimedia, 18(2), 12–15.
Sumi, K., Arai, M., Fujishima, T., Hashimoto, S. (2012). A music retrieval system using chroma and pitch features based on conditional random fields. In 2012 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1997–2000). IEEE.
Zils, A., & Pachet, F. (2004). Automatic extraction of music descriptors from acoustic signals using EDS. In Proc. AES.
About this article
Cite this article
Humphrey, E.J., Bello, J.P. & LeCun, Y. Feature learning and deep architectures: new directions for music informatics. J Intell Inf Syst 41, 461–481 (2013). https://doi.org/10.1007/s10844-013-0248-5
- Music informatics
- Deep learning
- Signal processing