Skip to main content
Log in

Separation of speech & music using temporal-spectral features and neural classifiers

  • Research Paper
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

Separation of speech and music plays a vital role in multiple fields related to audio and speech processing. The spectrograms of speech and music show distinct patterns. This serves as the motivation for the differentiation of speech and music signals in an audio segment. The patterns have been further emphasized using Sobel edge kernels, Mel-spectrograms. For the inception of this paper, we have made a dataset from “All India Radio” news archives which is having separate and overlapped speech and music data in different languages. The different input features are extracted from these audio segments and further emphasized before feeding them to the different classifiers for distinguishing speech and music frames. We also compared the different classification algorithms for their varied performance in terms of accuracy. We have found that the convolutional neural network based approach on Mel-spectrograms and MFCC-delta-RNN methods have given a significantly better result compared to other approaches. Further, we wanted to see how these approaches work in the audio data of different languages, hence, we have applied the proposed method in three different languages such as Bengali, Punjabi, and Tamil. We have seen that the performance of the proposed method in all languages is consistent. The paper has also attempted to solve the problem of classifying audio segments with overlapped speech and music regions and achieved a good level of accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Availability of data and materials

The dataset generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

The code is available on reasonable request.

References

  1. Carey Michael J, Eluned S Parris, Harvey Lloyd-Thomas (1999) “A comparison of features for speech, music discrimination.” In: 1999 IEEE international conference on acoustics, speech, and signal processing. proceedings. ICASSP99 (Cat. No. 99CH36258). Vol. 1. IEEE

  2. Saunders John (1996) “Real-time discrimination of broadcast speech/music.” In: 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings. Vol. 2. IEEE

  3. Bhattacharjee Mrinmoy SR, Prasanna Mahadeva, Guha Prithwijit (2020) Speech/music classification using features from spectral peaks. IEEE/ACM Trans Audio, Speech, Lang Process 28:1549–1559

    Article  Google Scholar 

  4. Yuan Chun-Miao, Xue-Mei Sun, Hu Zhao (2020) “Speech separation using convolutional neural network and attention mechanism.” Discrete Dynamics in Nature and Society 2020

  5. Li Zewen, et al. (2021) “A survey of convolutional neural networks: analysis, applications, and prospects.” IEEE Transactions on Neural Networks and Learning Systems

  6. Albawi Saad, Tareq Abed Mohammed, Saad Al-Zawi (2017) “Understanding of a convolutional neural network.” In: 2017 international conference on engineering and technology (ICET). IEEE

  7. Pinquier Julien, Rouas J-L, Régine André-Obrecht (2003) “A fusion study in speech/music classification.” In: 2003 IEEE international conference on acoustics, speech, and signal processing, 2003. Proceedings.(ICASSP’03).. Vol. 2. IEEE

  8. Ghosal Arijit, Bibhas Chandra Dhara, Sanjoy Kumar Saha (2011) “Speech/music classification using empirical mode decomposition.”In: 2011 second international conference on emerging applications of information technology. IEEE

  9. Munoz-Expósito J. E., et al. (2006) “Speech/music discrimination using awarped LPC-based feature and a fuzzy expert system for intelligent audio coding.” In: 2006 14th European Signal Processing Conference. IEEE

  10. Bakshi Aarti, Kopparapu Sunil Kumar (2018) Spoken Indian language identification: a review of features and databases. Sādhanā 43(4):1–14

    MathSciNet  Google Scholar 

  11. Pinquier Julien, Christine Sénac, Régine André-Obrecht (2002) “Speech and music classification in audio documents.” ICASSP

  12. Neammalai Piyawat, Suphakant Phimoltares, Chidchanok Lursinsap (2014) “Speech and music classification using hybrid form of spectrogram and fourier transformation.” In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE

  13. Li Zhitong et al (2018) “Optimization of EVS speech/music classifier based on deep learning.” In: 14th IEEE International Conference on Signal Processing (ICSP). IEEE

  14. Azizi A, Osgouie KG, Rashidnejhad S, Cheragh M (2013) Modeling of melatonin behavior in major depression a fuzzy logic modeling. Appl Mech Mater 367:317–321

    Article  Google Scholar 

  15. Al-Nima Raid, Fawaz Sultan & Nathim, Ali. (2021). Design a technology based on the fusion of genetic algorithm, Neural network and Fuzzy logic

  16. Azizi A (2020) Applications of artificial intelligence techniques to enhance sustainability of industry 4.0: design of an artificial neural network model as dynamic behavior optimizer of robotic arms. Complexity 2020:1–10. https://doi.org/10.1155/2020/8564140

    Article  Google Scholar 

  17. Ashkzari A, Azizi A (2014) Introducing genetic algorithm as an intelligent optimization technique. Appl Mech Mater 568–570:793–797. https://doi.org/10.4028/www.scientific.net/amm.568-570.793

    Article  Google Scholar 

  18. Azizi A (2020) A case study on computer-based analysis of the stochastic stability of mechanical structures driven by white and colored noise: utilizing artificial intelligence techniques to design an effective active suspension system. Complexity 2020:1–8. https://doi.org/10.1155/2020/7179801

    Article  Google Scholar 

  19. Azizi A, Entessari F, Osgouie KG, Rashnoodi AR (2013) Introducing neural networks as a computational intelligent technique. Appl Mech Mater 464:369–374

    Article  Google Scholar 

  20. Hughes T, Kristjansson T (2012) Music models for music-speech separation. Acoustics, speech, and signal processing, 1988. ICASSP-88. In: 1988 International Conference on. 4917-4920. https://doi.org/10.1109/ICASSP.2012.6289022

  21. Koolagudi Shashidhar G, Rastogi Deepika, Sreenivasa Rao K (2012) Identification of language using mel-frequency cepstral coefficients (MFCC). Procedia Eng 38:3391–3398

    Article  Google Scholar 

  22. Himadri Mukherjee et al (2020) A lazy learning-based language identification from speech using MFCC-2 features. Int J Mach Learn Cyber 11(1):1–14

    Article  Google Scholar 

  23. Anirban Bhowmick et al (2021) Identification/segmentation of indian regional languages with singular value decomposition based feature embedding. Appl Acoustics 176:107864

    Article  Google Scholar 

  24. Li Z, Xie X, Wang J, Grancharov V, & Liu W (2018). Optimization of EVS Speech/Music Classifier based on Deep Learning. In: 2018 14th IEEE international conference on signal processing (ICSP). https://doi.org/10.1109/icsp.2018.8652295

  25. Bhowmick Anirban, Biswas Astik, Chandra Mahesh (2020) Performance evaluation of psycho-acoustically motivated front-end compensator for TIMIT phone recognition. Pattern Anal Appl 23(2):527–539

    Article  MathSciNet  Google Scholar 

Download references

Funding

No funds, grants, or other support was received.

Author information

Authors and Affiliations

Authors

Contributions

OS took part in data collection, data modeling and formulating algorithm, interpretation and paper writing, AB took part in conceptualizing the work, data interpretation. GB took part in data collection, data modeling and execution.

Corresponding author

Correspondence to Anirban Bhowmick.

Ethics declarations

Conflict of interest

(check journal-specific guidelines for which heading to use) : The authors have no conflicts of interest to declare that are relevant to the content of this article.

Consent to participate

Yes

Consent for publication

Yes

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sawant, O., Bhowmick, A. & Bhagwat, G. Separation of speech & music using temporal-spectral features and neural classifiers. Evol. Intel. (2023). https://doi.org/10.1007/s12065-023-00828-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12065-023-00828-0

Keywords

Navigation