Feature learning and deep architectures: new directions for music informatics
- 2.2k Downloads
- 32 Citations
Abstract
As we look to advance the state of the art in content-based music informatics, there is a general sense that progress is decelerating throughout the field. On closer inspection, performance trajectories across several applications reveal that this is indeed the case, raising some difficult questions for the discipline: why are we slowing down, and what can we do about it? Here, we strive to address both of these concerns. First, we critically review the standard approach to music signal analysis and identify three specific deficiencies to current methods: hand-crafted feature design is sub-optimal and unsustainable, the power of shallow architectures is fundamentally limited, and short-time analysis cannot encode musically meaningful structure. Acknowledging breakthroughs in other perceptual AI domains, we offer that deep learning holds the potential to overcome each of these obstacles. Through conceptual arguments for feature learning and deeper processing architectures, we demonstrate how deep processing models are more powerful extensions of current methods, and why now is the time for this paradigm shift. Finally, we conclude with a discussion of current challenges and the potential impact to further motivate an exploration of this promising research area.
Keywords
Music informatics Deep learning Signal processingReferences
- Andén, J., & Mallat, S. (2011). Multiscale scattering for audio classification. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M. (2005). A tutorial on onset detection in music signals. IEEE Transactions on Audio, Speech and Language Processing, 13(5), 1035–1047.CrossRefGoogle Scholar
- Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.MathSciNetCrossRefMATHGoogle Scholar
- Bengio, Y., Courville, A.C., Vincent, P. (2012). Unsupervised feature learning and deep learning: a review and new perspectives. arXiv:1206.5538.
- Bengio, Y., & LeCun, Y. (2007). Scaling learning algorithms towards AI. In Large-Scale Kernel Machines (Vol. 34).Google Scholar
- Berenzweig, A., Logan, B., Ellis, D.P., Whitman, B. (2004). A large-scale evaluation of acoustic and subjective music-similarity measures. Computer Music Journal, 28(2), 63–76.CrossRefGoogle Scholar
- Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y. (2010). Theano: A CPU and GPU math expression compiler. In Proc. of the Python for Scientific computing conf. (SciPy).Google Scholar
- Bertin-Mahieux, T., & Ellis, D.P.W. (2012). Large-scale cover song recognition using the 2D fourier transform magnitude. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR) (pp. 241–246).Google Scholar
- Bishop, C. (2006). Pattern recognition and machine learning. Springer.Google Scholar
- Cabral, G., & Pachet, F. (2006). Recognizing chords with EDS: Part One. Computer Music Modeling and Retrieval (pp. 185–195).Google Scholar
- Casey, M., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M. (2008). Content-based music information retrieval: current directions and future challenges. Proceedings of the IEEE, 96(4), 668–696.CrossRefGoogle Scholar
- Cho, T., & Bello, J.P. (2011). A feature smoothing method for chord recognition using recurrence plots. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Chordia, P., Sastry, A., Sentürk, S. (2011). Predictive tabla modelling using variable-length markov and hidden markov models. Journal of New Music Research, 40(2), 105–118.CrossRefGoogle Scholar
- Collobert, R., Kavukcuoglu, K., Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop.Google Scholar
- Dannenberg, R. (1984). An on-line algorithm for real-time accompaniment. In Proc. Int. Computer Music Conf. (pp. 193–198).Google Scholar
- Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.CrossRefGoogle Scholar
- Dieleman, S., Brakel, P., Schrauwen, B. (2011). Audio-based music classification with a pretrained convolutional network. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Dixon, S. (2007). Evaluation of the audio beat tracking system Beatroot. Journal of New Music Research, 36(1), 39–50.CrossRefGoogle Scholar
- Edward, W., & Kolen, J.F. (1994). Resonance and the perception of musical meter. Connection Science, 6(2–3), 177–208.Google Scholar
- Flexer, A., Schnitzer, D., Schlueter, J. (2012). A MIREX meta-analysis of hubness in audio music similarity. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR) (pp. 175–180).Google Scholar
- Fujishima, T. (1999). Realtime chord recognition of musical sound: a system using common lisp music. In Proc. int. computer music conf. Google Scholar
- Goto, M., & Muraoka, Y. (1995). A real-time beat tracking system for audio signals. In Proc. int. computer music conf. (pp. 171–174).Google Scholar
- Grosche, P., & Müller, M. (2011). Extracting predominant local pulse information from music recordings. IEEE Transactions on Audio, Speech and Language Processing, 19(6), 1688–1701.CrossRefGoogle Scholar
- Hadsell, R., Chopra, S., LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Proc. Computer Vision and Pattern Recognition conf. (CVPR). IEEE Press.Google Scholar
- Hamel, P., Wood, S., Eck, D. (2009). Automatic identification of instrument classes in polyphonic and poly-instrument audio. In Proc. 10th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine. doi: 10.1109/MSP.2012.2205597.Google Scholar
- Hinton, G.E., Osindero, S., Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.MathSciNetCrossRefMATHGoogle Scholar
- Humphrey, E.J., & Bello, J.P. (2012). Rethinking automatic chord recognition with convolutional neural networks. In Proc. Int. Conf. on Machine Learning and Applications.Google Scholar
- Humphrey, E.J., Bello, J.P., LeCun, Y. (2012). Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Humphrey, E.J., Glennon, A.P., Bello, J.P. (2010). Non-linear semantic embedding for organizing large instrument sample libraries. In Proc. ICMLA.Google Scholar
- Klapuri, A., & Davy, M. (2006). Signal processing methods for music transcription. Springer.Google Scholar
- Le, Q., Monga, R., Devin, M., Corrado, G., Chen, K., Ranzato, M., Dean, J., Ng, A. (2012). Building high-level features using large scale unsupervised learning. In Proc. Int. Conf. on Machine Learning (ICML).Google Scholar
- Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y. (2010). Tiled convolutional neural networks. In Advances in Neural Information Processing Systems (Vol. 23).Google Scholar
- Le Roux, N., & Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6), 1631–1649.MathSciNetCrossRefMATHGoogle Scholar
- LeCun, Y. (2012). Learning invariant feature hierarchies. In Computer vision–ECCV 2012. Workshops and demonstrations (pp. 496–505). Springer.Google Scholar
- LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F. (2006). A tutorial on energy-based learning. Predicting Structured Data.Google Scholar
- Leveau, P., Sodoyer, D., Daudet, L. (2007). Automatic instrument recognition in a polyphonic mixture using sparse representations. In Proc. 8th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Levy, M., Noland, K., Sandler, M. (2007). A comparison of timbral and harmonic music segmentation algorithms. In 2007 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (Vol. 4, pp. 1433–1436). IEEE.Google Scholar
- Levy, M., & Sandler, M. (2009). Music information retrieval using social tags and audio. IEEE Transactions on Multimedia, 11(3), 383–395.CrossRefGoogle Scholar
- Lyon, R., Rehn, M., Bengio, S., Walters, T., Chechik, G. (2010). Sound retrieval and ranking using sparse auditory representations. Neural computation, 22(9), 2390–2416.CrossRefMATHGoogle Scholar
- Mandel, M., & Ellis, D. (2005). Song-level features and support vector machines for music classification. In Proc. 6th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Mauch, M., & Dixon, S. (2010). Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech and Language Processing, 18(6), 1280–1289.CrossRefGoogle Scholar
- McFee, B., & Lanckriet, G. (2012). Hypergraph models of playlist dialects. In Proc. 13th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Müller, M., Ellis, D., Klapuri, A., Richard, G. (2011). Signal processing for music analysis. Journal Selected Topics in Signal Processing, 5(6), 1088–1110.CrossRefGoogle Scholar
- Müller, M., & Ewert, S. (2011). Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR). Miami, USA.Google Scholar
- Nam, J., Ngiam, J., Lee, H., Slaney, M. (2011). A classification-based polyphonic piano transcription approach using learned feature representations. In Proc. 12th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1), 588–601.CrossRefGoogle Scholar
- Schmidt, E.M., & Kim, Y.E. (2011). Modeling the acoustic structure of musical emotion with deep belief networks. In Proc. neural information processing systems.Google Scholar
- Sheh, A., & Ellis, D.P.W. (2003). Chord segmentation and recognition using em-trained hidden markov models. In Proc. 4th Int. Conf. on Music Information Retrieval (ISMIR).Google Scholar
- Slaney, M. (2011). Web-scale multimedia analysis: does content matter? IEEE Multimedia, 18(2), 12–15.CrossRefGoogle Scholar
- Sumi, K., Arai, M., Fujishima, T., Hashimoto, S. (2012). A music retrieval system using chroma and pitch features based on conditional random fields. In 2012 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1997–2000). IEEE.Google Scholar
- Zils, A., & Pachet, F. (2004). Automatic extraction of music descriptors from acoustic signals using EDS. In Proc. AES.Google Scholar