Skip to main content
Log in

A method of music autotagging based on audio and lyrics

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


With the development of the Internet and technology, online music platforms and music streaming services are flourishing. Information overload due to an abundance of digital music has become a common problem for many users. Social tags that are helpful for music recommendations have been discussed. However, label sparsity and a cold start problem, commonly observed with social tags, limit the effectiveness in supporting the recommendation system. A music autotagging system then becomes an alternative solution for supplementing a shortage of tags. Most prior studies on automatic labeling used only audio data for their analysis. However, some studies have suggested that lyrics enhance the music classification system to obtain more information and improve the overall accuracy. In addition to lyrics, audio data are also an important resource for finding music features. In summary, this paper proposes a music autotagging system that relies on both audio and lyrics to solve the above problems. Due to the development of deep learning algorithms in recent years, many scholars have effectively used neural networks to extract audio and textual features. Some of them also considered a structure of lyrics to extract features that consequentially improves the classification task. For lyric feature extraction, this study employs two types of deep learning models: convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The feature extraction architecture is mainly motivated and characterized by the lyric architecture. In addition, a multitask learning method is adopted to learn correlations between tags. The experiments support that a multitask learning classifier that combines audio and lyric information has a better performance than a single-task learning classification method using only audio data than previous studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others


  1. enwiki-20,190,220-pages-articles:


  1. Alías, F., Socoró, J. C., & Sevillano, X. (2016). A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences 6(5):143

  2. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  3. Bertin-Mahieux, T., Eck, D., & Mandel, M. I. (2011). Automatic Tagging of Audio: The State-of-the-Art. Machine audition: Principles, algorithms and systems, IGI Global.

  4. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

  5. Choi K (2018) Deep neural networks for music tagging. Queen Mary University of London

  6. Choi, K., Fazekas, G., & Sandler, M. (2016). Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298.

  7. Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: Paper presented at the 2017 IEEE International Conference on Acoustics. Signal Processing (ICASSP), Speech and

    Google Scholar 

  8. Coviello, E. (2014). Automatic music tagging with time series models. UC San Diego.

    Google Scholar 

  9. Datta AK, Solanki SS, Sengupta R, Chakraborty S, Mahto K, Patranabis A (2017) Signal analysis of Hindustani classical music: Springer.Datta, A. K., Solanki, S. S., Sengupta, R., Chakraborty, S., Mahto, K., & Patranabis, A. (2017). Springer Singapore, Signal Analysis of Hindustani Classical Music

    Book  Google Scholar 

  10. De Leon, F., & Martinez, K. (2012). Enhancing timbre model using MFCC and its time derivatives for music similarity estimation. Paper presented at the 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

  11. Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J., & Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural Net. arXiv preprint arXiv:1809.07276.

  12. Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. Paper presented at the IEEE International Conference on Acoustics. Speech and Signal Processing, Florence, Italy

    Google Scholar 

  13. Duan SF, Zhang JL, Roe P, Towsey M (2014) A survey of tagging techniques for music, speech and environmental sound. Artif Intell Rev 42(4):637–661 Retrieved from <Go to ISI>://WOS:000345089400005

    Article  Google Scholar 

  14. Fell M, Sporleder C (2014) Lyrics-based analysis and classification of music. Paper presented at the International Conference on Computational Linguistics. Dublin, Ireland

    Google Scholar 

  15. Gossi D, Gunes MH (2016) Lyric-based music recommendation. In: Cherifi H, Gonçalves B, Menezes R, Sinatra R (eds) Complex networks VII: Proceedings of the 7th Workshop on Complex Networks CompleNet 2016. Springer International Publishing, Cham, pp 301–310

    Chapter  Google Scholar 

  16. Gouyon, F., Sturm, B., Oliveira, J., Hespanhol, N. & Langlois, T. (2014) On evaluation validity in music autotagging, arXiv preprint arXiv:1410.0001.

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780

    Article  Google Scholar 

  18. Horsburgh B, Craw S, Massie S (2015) Learning pseudo-tags to augment sparse tagging in hybrid music recommender systems. Artificial Intelligence Review 219(C):25–39

    Article  Google Scholar 

  19. Hu X, Choi K, Downie JS (2017) A framework for evaluating multimodal music mood classification. Journal of the Association for Information Science and Technology 68(2):273–285 Retrieved from

    Article  Google Scholar 

  20. Huang, Y., Wang, W., Wang, L., & Tan, T. (2013). Multi-task deep neural network for multi-label learning. Paper presented at the 2013 IEEE International Conference on Image Processing.

  21. Huang Y, Wang W, Wang L (2015) Unconstrained multimodal multi-label learning. Ieee Trans Multimed 17(11):1923–1935

    Article  Google Scholar 

  22. Jeong, I.-Y., & Lim, H. (2018). Audio tagging system using densely connected convolutional networks. Paper presented at the Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018).

  23. Kaminskas M, Ricci F, Schedl M (2013) Location-aware music recommendation using auto-tagging and hybrid matching. In: Paper presented at the 7th ACM conference on Recommender systems. China, Hong Kong

    Google Scholar 

  24. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

  25. Kim, T., Lee, J., & Nam, J. (2018). Sample-level cnn architectures for music auto-tagging using raw waveforms. Paper presented at the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).

  26. Knees P, Schedl M (2013) A survey of music similarity and recommendation from music context data. Acm Transactions on Multimedia Computing Communications and Applications 10(1):21 Retrieved from <Go to ISI>://WOS:000329025400002

    Article  Google Scholar 

  27. Labrosa. (2011). Last.Fm dataset. Retrieved from:

  28. Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. Paper presented at the Association for the Advancement of Artificial Intelligence. Austin Texas, USA

    Google Scholar 

  29. Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE signal processing letters 24(8):1208–1212

    Article  Google Scholar 

  30. Lee, J., Park, J., Kim, K. L., & Nam, J. (2017). Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:.01789.

  31. Lee J, Park J, Kim KL, Nam J (2018) SampleCNN: end-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences 8(1):150

    Article  Google Scholar 

  32. Liu, K., Li, Y., Xu, N., & Natarajan, P. (2018). Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:.11730.

  33. Malheiro R, Panda R, Gomes P, Paiva RP (2018) Emotionally-relevant features for classification and regression of music lyrics. IEEE Trans Affective Comput (2):240–254

  34. Marques G, Domingues M, Langlois T, Gouyon F (2011) Three current issues in music autotagging, paper presented at the Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR 2011. Miami, Florida

    Google Scholar 

  35. Nam, J., Herrera, J., & Lee, K. (2015). A deep bag-of-features model for music auto-tagging. arXiv preprint arXiv:.04999.

  36. Nayyar, R. K., Nair, S., Patil, O., Pawar, R., & Lolage, A. (2017). Content-based auto-tagging of audios using deep learning. Paper presented at the 2017 International Conference on Big Data, IoT and Data Science (BID).

  37. Oğul H, Kırmacı B (2016) Lyrics mining for music meta-data estimation. Paper presented at the International Conference on Artificial Intelligence Applications and Innovations. Thessaloniki, Greece

    Google Scholar 

  38. Panagakis Y, Kotropoulos C (2012) Automatic music tagging by low-rank representation. In: Paper presented at the 2012 IEEE International Conference on Acoustics. Signal Processing (ICASSP), Speech and

    Google Scholar 

  39. PwC. (2017). Perspectives from the Global Entertainment and Media Outlook 2017–2021. Retrieved from

  40. Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:.05098.

  41. Shao X, Cheng Z, Kankanhalli MS (2019) Music auto-tagging based on the unified latent semantic modeling. Multimedia Tools Applications 78(1):161–176

    Article  Google Scholar 

  42. Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Applied Acoustics 158:107020

    Article  Google Scholar 

  43. Shen J, Meng W, Yan S, Pang H, Hua X (2010) Effective music tagging through advanced statistical modeling. Paper presented at the Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval.

  44. Shen J, Tao M, Qu Q, Tao D, Rui Y (2019) Toward efficient indexing structure for scalable content-based music retrieval. Multimedia Systems 25(6):639–653

    Article  Google Scholar 

  45. Song G, Wang Z, Han F, Ding S, Iqbal MA (2018) Music auto-tagging using deep recurrent neural networks. Neurocomputing 292:104–110 Retrieved from <Go to ISI>://WOS:000429321400009

    Article  Google Scholar 

  46. Sturm B (2014) The state of the art ten years after a state of the art: future research in music information retrieval. Journal of New Music Research 43(2):147–172

    Article  Google Scholar 

  47. Sung B, Chung M, Ko I (2008) A feature based music content recognition method using simplified MFCC. International Journal Principles Applications of Information Science and Technology 2(1):13–23

    Google Scholar 

  48. Thiruvengatanadhan R (2018) Music classification using MFCC and SVM. International Research Journal of Engineering and Technology 5:922–924

    Google Scholar 

  49. Tsaptsinos A (2017) Lyrics-based music genre classification using a hierarchical attention network. arXiv preprint arXiv:1707.04678.

  50. Turnbull D, Barrington L, Torres D, Lanckriet G (2008a) Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech, Language Processing 16(2):467–476

    Article  Google Scholar 

  51. Turnbull D, Barrington L, Lanckriet G (2008b) Five approaches to collecting tags for music. Paper presented at the Proceedings of the 9th International Conference on Music Information Retrieval, ISMIR, Philadelphia.

  52. Wang Q, Su F, Wang Y (2019) A hierarchical attentive deep neural network model for semantic music annotation integrating multiple music representations. Paper presented at the Proceedings of the 2019 on International Conference on Multimedia Retrieval.

  53. Wei S, Xu K, Wang D, Liao F, Wang H, Kong Q (2018) Sample mixed-based data augmentation for domestic audio tagging. arXiv preprint arXiv:.03883.

  54. Won M, Chun S, Serra X (2019a) Toward interpretable music tagging with self-attention. arXiv preprint arXiv:.04972.

  55. Won M, Chun S, Nieto O, Serra X (2019b) Automatic music tagging with Harmonic CNN. Paper presented at the 20th International society for music information retrieval Deft, Netherlands.

  56. Xie B, Bian W, Tao D, Chordia P (2011) Music tagging with regularized logistic regression. Paper presented at the ISMIR.

    Google Scholar 

  57. Xu Y, Kong Q, Wang W, Plumbley MD (2018). Large-scale weakly supervised audio classification using gated convolutional neural network. Paper presented at the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).

  58. Zhuang N, Yan Y, Chen S, Wang H, Shen C (2018) Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recognition 80:225–240

    Article  Google Scholar 

  59. Zuo Y, Zeng J, Gong M, Jiao L (2016) Tag-aware recommender systems based on deep neural networks. Neurocomputing 204:51–60

    Article  Google Scholar 

Download references


The research is based on work supported by Taiwan Ministry of Science and Technology under Grant No. MOST 107-2410-H-006 040-MY3 and 108-2511-H-006-009. We would like to thank the Center of Innovative Fintech Business Models for a research grant to support this research.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Hei-Chia Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, HC., Syu, SW. & Wongchaisuwat, P. A method of music autotagging based on audio and lyrics. Multimed Tools Appl 80, 15511–15539 (2021).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: