Abstract
The essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for music representation (SLEM), leveraging the power of open-sourced, lightweight, and state-of-the-art deep learning vision and language models to encode songs. This work summarises extensive experimentation with over 20 deep learning-based music embeddings of a self-curated and hand-labeled multi-lingual dataset of 226 recent songs spread over 5 genres. Our aim is to study the effects of varying the weight of lyrics and spectrograms in the embeddings on the multi-class genre classification. The purpose of this study is to prove that a simple linear combination of both modalities is better than either modality alone. Our methods achieve an accuracy ranging between 81.08% to 98.60% for different genres, by using the K-nearest neighbors algorithm on the multimodal embeddings. We successfully study the intricacies of genres in this representational space, including their misclassification, visual clustering with EM-GMM, and the domain-specific meaning of the multi-modal weight for each genre with respect to ’instrumentalness’ and ’energy’ metadata. SLEM presents one of the first works on an end-to-end method that uses spectro-lyrical embeddings without hand-engineered features.
Similar content being viewed by others
Data availibility statement
The dataset generated for the purpose of the research is made publicly available by the authors in a github repository https://github.com/aryanmehra1999/SLEM. This is an easily downloadable CSV file, that contains the Spotify generated metadata for 227 songs along with their lyrics.
References
Bahuleyan H (2018) Music genre classification using machine learning techniques. arXiv:1804.01149
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Bertin-Mahieux T, Ellis DP, Whitman B et al (2011) The million song dataset. Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2011)
Cai X, Zhang H (2022) Music genre classification based on auditory image, spectral and acoustic features. Multimed Syst 28(3):779–791
Castillo JR, Flores MJ (2021) Web-based music genre classification for timeline song visualization and analysis. IEEE Access 9:18801–18816. https://doi.org/10.1109/ACCESS.2021.3053864
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
Costa YM, Oliveira LS, Koericb AL et al (2011) Music genre recognition using spectrograms. In: 2011 18th International conference on systems, signals and image processing, IEEE, pp 1–4
Costa YM, Oliveira LS, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Appl Soft Comput 52:28–38
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Duggirala S, Moh TS (2020) A novel approach to music genre classification using natural language processing and spark. In: 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), IEEE, pp 1–8
Humphrey EJ, Bello JP, LeCun Y (2013) Feature learning and deep architectures: New directions for music informatics. J Intell Inf Syst 41:461–481
Ishaq M, Khan M, Kwon S (2023) Tc-net: A modest & lightweight emotion recognition system using temporal convolution network. Comput Syst Sci Eng 46(3)
Khan M, Gueaieb W, El Saddik A et al (2024) Mser: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Syst Appl 245:122946
Kumar M, Walia GK, Shingare H et al (2023) Ai-based sustainable and intelligent offloading framework for iiot in collaborative cloud-fog environments. IEEE Trans Consum Electron pp 1–1. https://doi.org/10.1109/TCE.2023.3320673
Li J, Han L, Li X et al (2022) An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Appl pp 1–27
Li T, Ogihara M, Li Q (2003) A comparative study on content-based music genre classification. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp 282–289
Liu Y, Ott M, Goyal N et al (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692
Lyrics-Genius (2018) Genius.com. Genius (Lyrics Genius) Open Source python API. https://github.com/johnwmillr/LyricsGenius
Mao Y, Zhong G, Wang H et al (2022) Music-crn: An efficient content-based music classification and recommendation network. Cogn Comput 14(6):2306–2316
Mayer R, Rauber A (2011) Musical genre classification by ensembles of audio and lyrics features. In: Proceedings of international conference on music information retrieval, pp 675–680
Mayer R, Neumayer R, Rauber A (2008) Rhyme and style features for musical genre classification by song lyrics. In: Ismir, pp 337–342
McKay C, Fujinaga I (2006) Musical genre classification: Is it worth pursuing and how can it be improved? In: ISMIR, pp 101–106
McKay C, Fujinaga I (2010) Improving automatic music classification performance by extracting features from different types of data. In: Proceedings of the International Conference on Multimedia Information Retrieval. Association for Computing Machinery, New York, NY, USA, MIR ’10, pp 257–266. https://doi.org/10.1145/1743384.1743430
McKay C, Burgoyne JA, Hockman J et al (2010) Evaluating the genre classification performance of lyrical features relative to audio, symbolic and cultural features. In: ISMIR, pp 213–218
Mustaqeem K, El Saddik A, Alotaibi FS et al (2023) Aad-net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network. Knowl-Based Syst 270:110525
Nanni L, Costa YM, Lucio DR et al (2017) Combining visual and acoustic features for audio classification tasks. Pattern Recogn Lett 88:49–56
Narkhede N, Mathur S, Bhaskar A (2022) Machine learning techniques for music genre classification. In: Information and Communication Technology for Competitive Strategies (ICTCS 2020) ICT: Applications and Social Interfaces, Springer, pp 155–161
Ndou N, Ajoodha R, Jadhav A (2021) Music genre classification: A review of deep-learning and traditional machine-learning approaches. In: 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), IEEE, pp 1–6
Van den Oord A, Dieleman S, Schrauwen B (2013) Deep content-based music recommendation. Adv Neural Inf Process Syst 26
Oramas S, Barbieri F, Nieto Caballero O et al (2018) (2018) Multimodal deep learning for music genre classification. Trans Int Soc Music Inf Retr 1(1):4–21
Prabhakar SK, Lee SW (2023) Holistic approaches to music genre classification using efficient transfer and deep learning techniques. Expert Syst Appl 211:118636
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv:1908.10084
Roy WG, Dowd TJ (2010) What is sociological about music? Annu Rev Sociol 36:183–203
Shah M, Pujara N, Mangaroliya K, et al (2022) Music genre classification using deep learning. In: 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, pp 974–978
Silla CN, Koerich AL, Kaestner CA (2008) A machine learning approach to automatic music genre classification. J Brazilian Comp Soc 14:7–18
Simonetta F, Ntalampiras S, Avanzini F (2019) Multimodal music information processing and retrieval: Survey and future challenges. In: 2019 international workshop on multilayer music representation and processing (MMRP), IEEE, pp 10–18
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Singh Y, Biswas A (2022) Robustness of musical features on deep learning models for music genre classification. Expert Syst Appl 199:116879
Song K, Tan X, Qin T et al (2020) Mpnet: Masked and permuted pre-training for language understanding. Adv Neural Inf Process Syst 33:16857–16867
Spotipy-Developers (2020) Spotipy plugin. Spotipy Web API Documentation. https://github.com/spotipy-dev/spotipy
Sturm BL (2012) An analysis of the gtzan music genre dataset. In: Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies, pp 7–12
Sturm BL (2013) On music genre classification via compressive sampling. In: 2013 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6
Suman OP, Kumar M (2023) Machine learning based theoretical and experimental analysis of ddos attacks in cloud computing. In: 2023 International Conference on Device Intelligence, Computing and Communication Technologies, (DICCT), pp 526–531. https://doi.org/10.1109/DICCT56244.2023.10110201
Swain M, Maji B, Khan M et al (2023) Multilevel feature representation for hybrid transformers-based emotion recognition. In: 2023 5th International Conference on Bio-engineering for Smart Technologies (BioSMART), pp 1–5. https://doi.org/10.1109/BioSMART58455.2023.10162089
Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: International conference on machine learning, PMLR, pp 10096–10106
Tsaptsinos A (2017) Lyrics-based music genre classification using a hierarchical attention network. Proceedings of the 18th International Society for Music Information Retrieval (ISMIR) Conference pp 694–700
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10(5):293–302
Walia GK, Kumar M, Gill SS (2023) Ai-empowered fog/edge resource management for iot applications: A comprehensive review, research challenges and future perspectives. IEEE Commun Surv Tutorials pp 1–1. https://doi.org/10.1109/COMST.2023.3338015
Wallin NL, Merker B, Brown S (2001) The origins of music. MIT press
Wang W, Wei F, Dong L et al (2020) Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv Neural Inf Process Syst 33:5776–5788
Wu MJ, Jang JSR (2015) Combining acoustic and multilevel visual features for music genre classification. ACM Trans Multimed Comput Commun Appl (TOMM) 12(1):1–17
Yang T, Nazir S (2022) A comprehensive overview of ai-enabled music classification and its influence in games. Soft Comput 26(16):7679–7693
Yu Y, Luo S, Liu S et al (2020) Deep attention based music genre classification. Neurocomputing 372:84–91
Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mehra, A., Mehra, A. & Narang, P. Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM). Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19160-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-19160-5