Deep learning of chroma representation for cover song identification in compression domain

Fang, Jiunn-Tsair; Chang, Yu-Ruey; Chang, Pao-Chi

doi:10.1007/s11045-017-0476-x

Deep learning of chroma representation for cover song identification in compression domain

Published: 21 February 2017

Volume 29, pages 887–902, (2018)
Cite this article

Multidimensional Systems and Signal Processing Aims and scope Submit manuscript

Jiunn-Tsair Fang¹,
Yu-Ruey Chang² &
Pao-Chi Chang²

601 Accesses
2 Citations
Explore all metrics

Abstract

Methods for identifying a cover song typically involve comparing the similarity of chroma features between the query song and another song in the data set. However, considerable time is required for pairwise comparisons. In addition, to save disk space, most songs stored in the data set are in a compressed format. Therefore, to eliminate some decoding procedures, this study extracted music information directly from the modified discrete cosine transform coefficients of advanced audio coding and then mapped these coefficients to 12-dimensional chroma features. The chroma features were segmented to preserve the melodies. Each chroma feature segment was trained and learned by a sparse autoencoder, a deep learning architecture of artificial neural networks. The deep learning procedure was to transform chroma features into an intermediate representation for dimension reduction. Experimental results from a covers80 data set showed that the mean reciprocal rank increased to 0.5 and the matching time was reduced by over 94% compared with traditional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep feature learning for cover song identification

Article 13 November 2016

Deep Learning for Cover Song Apperception

Evaluation of Chord and Chroma Features and Dynamic Time Warping Scores on Cover Song Identification Task

References

Al-Shareef, A. J., Mohamed, E. A., & Al-Judaibi, E. (2008). One hour ahead load forecasting using artificial neural network for the western area of Saudi Arabia. International Journal of Electrical and Computer Engineering, 3(13), 834–840.
Google Scholar
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Proceedings of the Advances in Neural Information Processing Systems (pp. 153–160).
Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends and in Machine Learning, 2(1), 1–127.
Article MATH Google Scholar
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Article Google Scholar
Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.
MathSciNet Google Scholar
Chang, T. M., Chen, E. T., Hsieh, C. B., & Chang, P. C. (2013). Cover song identification with direct chroma feature extraction from AAC files. In Proceedings of GCCE, Tokyo (pp. 55–56).
Dahl, G. E., et al. (2010). Phone recognitionwith the mean-covariance restricted Boltzmann machine. Advances in Neural Information Processing Systems, 23, 469–477.
Google Scholar
Ellis, D. (2006). Beat tracking with dynamic programming. In MIREX 2006 audio beat tracking contest system description.
Ellis, D. P. W., & Poliner, G. E. (2007). Identifying cover songs with chroma features and dynamic programming beat tracking. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Honolulu, HI (pp. 1429–1432).
Fujishima, T. (1999). Realtime chord recognition of musical sound: A system using common lisp music. In Proceedings of international computer music conference, Beijing (pp. 464–467).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. ArXiv e-prints 1207, 580.
Hinton, G. E., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Hinton, E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.
Article MathSciNet MATH Google Scholar
Hinton, G. E., & Salakhutdinov, R. S. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Article MathSciNet MATH Google Scholar
ISO/IEC 13818-7. (1997). Information technology—Generic coding of moving pictures and associated audio information—Part 7: Advanced audio coding (AAC).
Kiranyaz, S., Qureshi, A. F., & Gabbouj, M. (2006). A generic audio classification and segmentation approach for multimedia indexing and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 1062–1081.
Article Google Scholar
Lee, K. (2006). Identifying cover songs from audio using harmonic representation. Music Information Retrieval Evaluation eXchange (MIREX) extended abstract.
Matlab Central, Deep Learning Toolbox [Online]. http://www.mathworks.com/matlabcentral/fileexchange/38310-deep-learning-toolbox.
Mnih, A., & Hinton, G. E. (2005). Learning nonlinear constraints with contrastive backpropagation. In 2005 IEEE international joint conference on neural networks, IJCNN’05. Proceedings (pp. 1302–1307).
Muller, M., Ellis, D. P. W., Klapuri, A., & Richard, G. (2011). Signal processing for music analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1088–1110.
Article Google Scholar
Nair, V., & Hinton, G. E. (2009). 3D object recognition with deep belief nets. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS ’09 (pp. 1339–1347).
Ng, A. (2011). Sparse autoencoder. In CS294A lecture notes.
Patel, N., & Sethi, I. (1996). Audio characterization for video indexing. In Proceedings of SPIE (pp. 373–384).
Ranzato, M., Boureau, Y., & LeCun,Y. (2007). Sparse feature learning for deep belief networks. In Advances in neural information processing systems 20 (NIPS).
Ravelli, E., Richard, G., & Daudet, L. (2010). Audio signal representations for indexing in the transform domain. IEEE Transactions on Audio, Speech, and Language Processing, 18(3), 434–446.
Article Google Scholar
Riley, M., Heinen, E., & Ghosh, J. (2008). A text retrieval approach to content-based audio retrieval. In Proceedings of international conference on music information retrieval, Philadelphia, Pennsylvaia (pp. 295–300).
Sailer, C., & Dressler, K. (2006). Finding cover songs by melodic similarity. Music Information Retrieval Evaluation eXchange (MIREX) extended abstract
Salakhutdinov, R. (2009). Learning deep generative models. Doctoral dissertation, University of Toronto.
Salakhutdinov, R. Nonlinear dimensionality reduction using neural networks. http://www.cs.toronto.edu/~rsalakhu/talks/NLDR_NIPS06workshop.pdf.
Serra, J., G’omez, E., & Herrera, P. (2008). Transposing chroma representations to a common key. In Proceedings of IEEE CS conference on the use of symbols to represent music and multimedia objects, Citeseer (pp. 45–48).
Shepard, R. N. (1982). Structural representations of musical pitch. In D. Deutsch (Ed.), The psychology of music (1st ed.). Amsterdam: Swets & Zeitlinger.
Google Scholar
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart, J. L. McClelland & C. PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 194–281). Cambridge, MA: MIT Press.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskeve, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958.
MathSciNet MATH Google Scholar
The Covers80 cover song data set [Online]. http://labrosa.ee.columbia.edu/projects/coversongs/covers80/.
Tsai, T. H., & Chang, W. C. (2009). Two-stage method for specific audio retrieval based on MP3 compression domain. In Proceedings of IEEE international symposium on circuits and systems (pp. 713–716).
Tsai, T. H., & Wang, Y. T. (2004). Content-based retrieval of audio example on MP3 compression domain. In Proceedings of IEEE 6th workshop on multimedia signal processing (pp. 123–126).
Voorhees, E. M. (1999). The TREC-8 question answering track report. In Proceedings of the 8th text retrieval conference (TREC-8).
Waterman, M. S., & Smith, T. F. (1978). RNA secondary structure: A complete mathematical analysis. Mathematical Biosciences, 42(3–4), 257–266.
Article MATH Google Scholar
Yapp, L., & Zick, G. (1997). Speech recognition on MPEG/audio encoded files. In Proceedings of IEEE international conference multimedia computing and systems (pp. 624–625).

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Ming Chuan University, No.5, Deming Rd., Taoyuan City, 33348, Taiwan
Jiunn-Tsair Fang
Department of Communication Engineering, National Central University, No.300, Jhongda Rd., Taoyuan City, 32001, Taiwan
Yu-Ruey Chang & Pao-Chi Chang

Authors

Jiunn-Tsair Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Ruey Chang
View author publications
You can also search for this author in PubMed Google Scholar
Pao-Chi Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pao-Chi Chang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, JT., Chang, YR. & Chang, PC. Deep learning of chroma representation for cover song identification in compression domain. Multidim Syst Sign Process 29, 887–902 (2018). https://doi.org/10.1007/s11045-017-0476-x

Download citation

Received: 26 February 2016
Revised: 25 October 2016
Accepted: 10 February 2017
Published: 21 February 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11045-017-0476-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning of chroma representation for cover song identification in compression domain

Abstract

Access this article

Similar content being viewed by others

Deep feature learning for cover song identification

Deep Learning for Cover Song Apperception

Evaluation of Chord and Chroma Features and Dynamic Time Warping Scores on Cover Song Identification Task

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep learning of chroma representation for cover song identification in compression domain

Abstract

Access this article

Similar content being viewed by others

Deep feature learning for cover song identification

Deep Learning for Cover Song Apperception

Evaluation of Chord and Chroma Features and Dynamic Time Warping Scores on Cover Song Identification Task

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation