Abstract
Separation of speech and music plays a vital role in multiple fields related to audio and speech processing. The spectrograms of speech and music show distinct patterns. This serves as the motivation for the differentiation of speech and music signals in an audio segment. The patterns have been further emphasized using Sobel edge kernels, Mel-spectrograms. For the inception of this paper, we have made a dataset from “All India Radio” news archives which is having separate and overlapped speech and music data in different languages. The different input features are extracted from these audio segments and further emphasized before feeding them to the different classifiers for distinguishing speech and music frames. We also compared the different classification algorithms for their varied performance in terms of accuracy. We have found that the convolutional neural network based approach on Mel-spectrograms and MFCC-delta-RNN methods have given a significantly better result compared to other approaches. Further, we wanted to see how these approaches work in the audio data of different languages, hence, we have applied the proposed method in three different languages such as Bengali, Punjabi, and Tamil. We have seen that the performance of the proposed method in all languages is consistent. The paper has also attempted to solve the problem of classifying audio segments with overlapped speech and music regions and achieved a good level of accuracy.
Similar content being viewed by others
Availability of data and materials
The dataset generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Code availability
The code is available on reasonable request.
References
Carey Michael J, Eluned S Parris, Harvey Lloyd-Thomas (1999) “A comparison of features for speech, music discrimination.” In: 1999 IEEE international conference on acoustics, speech, and signal processing. proceedings. ICASSP99 (Cat. No. 99CH36258). Vol. 1. IEEE
Saunders John (1996) “Real-time discrimination of broadcast speech/music.” In: 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings. Vol. 2. IEEE
Bhattacharjee Mrinmoy SR, Prasanna Mahadeva, Guha Prithwijit (2020) Speech/music classification using features from spectral peaks. IEEE/ACM Trans Audio, Speech, Lang Process 28:1549–1559
Yuan Chun-Miao, Xue-Mei Sun, Hu Zhao (2020) “Speech separation using convolutional neural network and attention mechanism.” Discrete Dynamics in Nature and Society 2020
Li Zewen, et al. (2021) “A survey of convolutional neural networks: analysis, applications, and prospects.” IEEE Transactions on Neural Networks and Learning Systems
Albawi Saad, Tareq Abed Mohammed, Saad Al-Zawi (2017) “Understanding of a convolutional neural network.” In: 2017 international conference on engineering and technology (ICET). IEEE
Pinquier Julien, Rouas J-L, Régine André-Obrecht (2003) “A fusion study in speech/music classification.” In: 2003 IEEE international conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03).. Vol. 2. IEEE
Ghosal Arijit, Bibhas Chandra Dhara, Sanjoy Kumar Saha (2011) “Speech/music classification using empirical mode decomposition.”In: 2011 second international conference on emerging applications of information technology. IEEE
Munoz-Expósito J. E., et al. (2006) “Speech/music discrimination using awarped LPC-based feature and a fuzzy expert system for intelligent audio coding.” In: 2006 14th European Signal Processing Conference. IEEE
Bakshi Aarti, Kopparapu Sunil Kumar (2018) Spoken Indian language identification: a review of features and databases. Sādhanā 43(4):1–14
Pinquier Julien, Christine Sénac, Régine André-Obrecht (2002) “Speech and music classification in audio documents.” ICASSP
Neammalai Piyawat, Suphakant Phimoltares, Chidchanok Lursinsap (2014) “Speech and music classification using hybrid form of spectrogram and fourier transformation.” In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE
Li Zhitong et al (2018) “Optimization of EVS speech/music classifier based on deep learning.” In: 14th IEEE International Conference on Signal Processing (ICSP). IEEE
Azizi A, Osgouie KG, Rashidnejhad S, Cheragh M (2013) Modeling of melatonin behavior in major depression a fuzzy logic modeling. Appl Mech Mater 367:317–321
Al-Nima Raid, Fawaz Sultan & Nathim, Ali. (2021). Design a technology based on the fusion of genetic algorithm, Neural network and Fuzzy logic
Azizi A (2020) Applications of artificial intelligence techniques to enhance sustainability of industry 4.0: design of an artificial neural network model as dynamic behavior optimizer of robotic arms. Complexity 2020:1–10. https://doi.org/10.1155/2020/8564140
Ashkzari A, Azizi A (2014) Introducing genetic algorithm as an intelligent optimization technique. Appl Mech Mater 568–570:793–797. https://doi.org/10.4028/www.scientific.net/amm.568-570.793
Azizi A (2020) A case study on computer-based analysis of the stochastic stability of mechanical structures driven by white and colored noise: utilizing artificial intelligence techniques to design an effective active suspension system. Complexity 2020:1–8. https://doi.org/10.1155/2020/7179801
Azizi A, Entessari F, Osgouie KG, Rashnoodi AR (2013) Introducing neural networks as a computational intelligent technique. Appl Mech Mater 464:369–374
Hughes T, Kristjansson T (2012) Music models for music-speech separation. Acoustics, speech, and signal processing, 1988. ICASSP-88. In: 1988 International Conference on. 4917-4920. https://doi.org/10.1109/ICASSP.2012.6289022
Koolagudi Shashidhar G, Rastogi Deepika, Sreenivasa Rao K (2012) Identification of language using mel-frequency cepstral coefficients (MFCC). Procedia Eng 38:3391–3398
Himadri Mukherjee et al (2020) A lazy learning-based language identification from speech using MFCC-2 features. Int J Mach Learn Cyber 11(1):1–14
Anirban Bhowmick et al (2021) Identification/segmentation of indian regional languages with singular value decomposition based feature embedding. Appl Acoustics 176:107864
Li Z, Xie X, Wang J, Grancharov V, & Liu W (2018). Optimization of EVS Speech/Music Classifier based on Deep Learning. In: 2018 14th IEEE international conference on signal processing (ICSP). https://doi.org/10.1109/icsp.2018.8652295
Bhowmick Anirban, Biswas Astik, Chandra Mahesh (2020) Performance evaluation of psycho-acoustically motivated front-end compensator for TIMIT phone recognition. Pattern Anal Appl 23(2):527–539
Funding
No funds, grants, or other support was received.
Author information
Authors and Affiliations
Contributions
OS took part in data collection, data modeling and formulating algorithm, interpretation and paper writing, AB took part in conceptualizing the work, data interpretation. GB took part in data collection, data modeling and execution.
Corresponding author
Ethics declarations
Conflict of interest
(check journal-specific guidelines for which heading to use) : The authors have no conflicts of interest to declare that are relevant to the content of this article.
Consent to participate
Yes
Consent for publication
Yes
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sawant, O., Bhowmick, A. & Bhagwat, G. Separation of speech & music using temporal-spectral features and neural classifiers. Evol. Intel. (2023). https://doi.org/10.1007/s12065-023-00828-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12065-023-00828-0