Skip to main content
Log in

Deep learning of chroma representation for cover song identification in compression domain

  • Published:
Multidimensional Systems and Signal Processing Aims and scope Submit manuscript

Abstract

Methods for identifying a cover song typically involve comparing the similarity of chroma features between the query song and another song in the data set. However, considerable time is required for pairwise comparisons. In addition, to save disk space, most songs stored in the data set are in a compressed format. Therefore, to eliminate some decoding procedures, this study extracted music information directly from the modified discrete cosine transform coefficients of advanced audio coding and then mapped these coefficients to 12-dimensional chroma features. The chroma features were segmented to preserve the melodies. Each chroma feature segment was trained and learned by a sparse autoencoder, a deep learning architecture of artificial neural networks. The deep learning procedure was to transform chroma features into an intermediate representation for dimension reduction. Experimental results from a covers80 data set showed that the mean reciprocal rank increased to 0.5 and the matching time was reduced by over 94% compared with traditional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Al-Shareef, A. J., Mohamed, E. A., & Al-Judaibi, E. (2008). One hour ahead load forecasting using artificial neural network for the western area of Saudi Arabia. International Journal of Electrical and Computer Engineering, 3(13), 834–840.

    Google Scholar 

  • Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Proceedings of the Advances in Neural Information Processing Systems (pp. 153–160).

  • Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends and in Machine Learning, 2(1), 1–127.

    Article  MATH  Google Scholar 

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.

    Article  Google Scholar 

  • Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.

    MathSciNet  Google Scholar 

  • Chang, T. M., Chen, E. T., Hsieh, C. B., & Chang, P. C. (2013). Cover song identification with direct chroma feature extraction from AAC files. In Proceedings of GCCE, Tokyo (pp. 55–56).

  • Dahl, G. E., et al. (2010). Phone recognitionwith the mean-covariance restricted Boltzmann machine. Advances in Neural Information Processing Systems, 23, 469–477.

    Google Scholar 

  • Ellis, D. (2006). Beat tracking with dynamic programming. In MIREX 2006 audio beat tracking contest system description.

  • Ellis, D. P. W., & Poliner, G. E. (2007). Identifying cover songs with chroma features and dynamic programming beat tracking. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Honolulu, HI (pp. 1429–1432).

  • Fujishima, T. (1999). Realtime chord recognition of musical sound: A system using common lisp music. In Proceedings of international computer music conference, Beijing (pp. 464–467).

  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. ArXiv e-prints 1207, 580.

  • Hinton, G. E., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Hinton, E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. S. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    Article  MathSciNet  MATH  Google Scholar 

  • ISO/IEC 13818-7. (1997). Information technology—Generic coding of moving pictures and associated audio information—Part 7: Advanced audio coding (AAC).

  • Kiranyaz, S., Qureshi, A. F., & Gabbouj, M. (2006). A generic audio classification and segmentation approach for multimedia indexing and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 1062–1081.

    Article  Google Scholar 

  • Lee, K. (2006). Identifying cover songs from audio using harmonic representation. Music Information Retrieval Evaluation eXchange (MIREX) extended abstract.

  • Matlab Central, Deep Learning Toolbox [Online]. http://www.mathworks.com/matlabcentral/fileexchange/38310-deep-learning-toolbox.

  • Mnih, A., & Hinton, G. E. (2005). Learning nonlinear constraints with contrastive backpropagation. In 2005 IEEE international joint conference on neural networks, IJCNN’05. Proceedings (pp. 1302–1307).

  • Muller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011). Signal processing for music analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110.

    Article  Google Scholar 

  • Nair, V., & Hinton, G. E. (2009). 3D object recognition with deep belief nets. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS ’09 (pp. 1339–1347).

  • Ng, A. (2011). Sparse autoencoder. In CS294A lecture notes.

  • Patel, N., & Sethi, I. (1996). Audio characterization for video indexing. In Proceedings of SPIE (pp. 373–384).

  • Ranzato, M., Boureau, Y., & LeCun,Y. (2007). Sparse feature learning for deep belief networks. In Advances in neural information processing systems 20 (NIPS).

  • Ravelli, E., Richard, G., & Daudet, L. (2010). Audio signal representations for indexing in the transform domain. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 434–446.

    Article  Google Scholar 

  • Riley, M., Heinen, E., & Ghosh, J. (2008). A text retrieval approach to content-based audio retrieval. In Proceedings of international conference on music information retrieval, Philadelphia, Pennsylvaia (pp. 295–300).

  • Sailer, C., & Dressler, K. (2006). Finding cover songs by melodic similarity. Music Information Retrieval Evaluation eXchange (MIREX) extended abstract

  • Salakhutdinov, R. (2009). Learning deep generative models. Doctoral dissertation, University of Toronto.

  • Salakhutdinov, R. Nonlinear dimensionality reduction using neural networks. http://www.cs.toronto.edu/~rsalakhu/talks/NLDR_NIPS06workshop.pdf.

  • Serra, J., G’omez, E., & Herrera, P. (2008). Transposing chroma representations to a common key. In Proceedings of IEEE CS conference on the use of symbols to represent music and multimedia objects, Citeseer (pp. 45–48).

  • Shepard, R. N. (1982). Structural representations of musical pitch. In D. Deutsch (Ed.), The psychology of music (1st ed.). Amsterdam: Swets & Zeitlinger.

    Google Scholar 

  • Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart, J. L. McClelland & C. PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 194–281). Cambridge, MA: MIT Press.

  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskeve, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.

    MathSciNet  MATH  Google Scholar 

  • The Covers80 cover song data set [Online]. http://labrosa.ee.columbia.edu/projects/coversongs/covers80/.

  • Tsai, T. H., & Chang, W. C. (2009). Two-stage method for specific audio retrieval based on MP3 compression domain. In Proceedings of IEEE international symposium on circuits and systems (pp. 713–716).

  • Tsai, T. H., & Wang, Y. T. (2004). Content-based retrieval of audio example on MP3 compression domain. In Proceedings of IEEE 6th workshop on multimedia signal processing (pp. 123–126).

  • Voorhees, E. M. (1999). The TREC-8 question answering track report. In Proceedings of the 8th text retrieval conference (TREC-8).

  • Waterman, M. S., & Smith, T. F. (1978). RNA secondary structure: A complete mathematical analysis. Mathematical Biosciences, 42(3–4), 257–266.

    Article  MATH  Google Scholar 

  • Yapp, L., & Zick, G. (1997). Speech recognition on MPEG/audio encoded files. In Proceedings of IEEE international conference multimedia computing and systems (pp. 624–625).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pao-Chi Chang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, JT., Chang, YR. & Chang, PC. Deep learning of chroma representation for cover song identification in compression domain. Multidim Syst Sign Process 29, 887–902 (2018). https://doi.org/10.1007/s11045-017-0476-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11045-017-0476-x

Keywords

Navigation