Skip to main content
Log in

Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR): A Novel Feature Extraction Algorithm for Speech/Music Discrimination

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Multimedia data have increased dramatically today, making the distinction between desirable information and other types of information extremely important. Speech/music discrimination is a field of audio analytics that aims to detect and classify speech and music segments in an audio file. This paper proposes a novel feature extraction method called Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR). The proposed feature computes the average frequency-domain mean-crossing rate along the frequency axis for each of the perceptual Mel-scaled frequency bands of the signal power spectrum. In this paper, the class-separation capability of this feature is first measured by well-known divergence criteria such as Maximum Fisher Discriminant Ratio (MFDR), Bhattacharyya divergence, and Jeffreys/Symmetric Kullback–Leibler (SKL) divergence. The proposed feature is then applied to the speech/music discrimination (SMD) process on two well-known speech-music datasets—GTZAN and S &S (Scheirer and Slaney). The results obtained on the two datasets using conventional classifiers, including k-NN, GMM, and SVM, as well as deep learning-based classification methods, including CNN, LSTM, and BiLSTM, show that the proposed feature outperforms other features in speech/music discrimination.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The source code of the FDMCR has been registered as "Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR) feature" on IEEE DataPort [21] with this DOI: "10.21227/H2NW6G".

  2. Available: “http://opihi.cs.uvic.ca/sound/music_speech.tar.gz". Accessed: 3/13/2021.

  3. Accessible from this address: “http://www.ee.columbia.edu/\(\sim \)dpwe /sounds/musp/".

References

  1. K.T. Abou-Moustafa, F.P. Ferrie, A note on metric properties for some divergence measures: The gaussian case. in Asian Conference on Machine Learning, pp. 1–15 (2012)

  2. F. Alías, J. Socoró, X. Sevillano, A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Appl. Sci. 6(5), 143 (2016)

    Article  Google Scholar 

  3. G. Aneeja, B. Yegnanarayana, Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 705–717 (2015)

    Article  Google Scholar 

  4. M. Anusuya, S. Katti, Front end analysis of speech recognition: a review. Int. J. Speech Technol. 14(2), 99–145 (2011)

    Article  Google Scholar 

  5. R.G. Balamurali, C. Rajagopal, Speech/music discrimination (2017). US Patent 9,613,640

  6. A.L. Berenzweig, D.P. Ellis, Locating singing voice segments within music signals. in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575), pp. 119–122 (2001)

  7. M. Bhattacharjee, S.M. Prasanna, P. Guha, Speech/music classification using features from spectral peaks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1549–1559 (2020)

    Article  Google Scholar 

  8. G.K. Birajdar, M.D. Patil, Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimed. Tools Appl. 78(11), 15141–15168 (2019)

    Article  Google Scholar 

  9. G.K. Birajdar, M.D. Patil, Speech/music classification using visual and spectral chromagram features. J. Ambient Intell. Hum. Comput. 11, 1–19 (2019)

    Google Scholar 

  10. M.J. Carey, E.S. Parris, H. Lloyd-Thomas, A comparison of features for speech, music discrimination. in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152 (1999)

  11. A. Chen, M.A. Hasegawa-Johnson, Mixed stereo audio classification using a stereo-input mixed-to-panned level feature. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2025–2033 (2014)

    Article  Google Scholar 

  12. T. Drugman, Y. Stylianou, Y. Kida, M. Akamine, Voice activity detection: merging source and filter-based information. IEEE Signal Process. Lett. 23(2), 252–256 (2015)

    Article  Google Scholar 

  13. S. Duan, J. Zhang, P. Roe, M. Towsey, A survey of tagging techniques for music, speech and environmental sound. Artif. Intell. Rev. 42(4), 637–661 (2014)

    Article  Google Scholar 

  14. K. El-Maleh, M. Klein, G. Petrucci, P. Kabal, Speech/music discrimination for multimedia applications. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 4, pp. 2445–2448 (2000)

  15. G. Fuchs, A robust speech/music discriminator for switched audio coding. in 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 569–573 (2015)

  16. P.K. Ghosh, A. Tsiartas, S. Narayanan, Robust voice activity detection using long-term signal variability. IEEE Trans. Audio Speech Lang. Process. 19(3), 600–613 (2010)

    Article  Google Scholar 

  17. P. Gimeno, I. Viñals, A. Ortega, A. Miguel, E. Lleida, Multiclass audio segmentation based on recurrent neural networks for broadcast domain data. EURASIP J. Audio Speech Music Process. 2020(1), 1–19 (2020)

    Article  Google Scholar 

  18. B.Y. Jang, W.H. Heo, J.H. Kim, O.W. Kwon, Music detection from broadcast contents using convolutional neural networks with a mel-scale kernel. EURASIP J. Audio Speech Music Process. 2019(1), 11 (2019)

    Article  Google Scholar 

  19. M. Joshi, S. Nadgir, Extraction of feature vectors for analysis of musical instruments. in 2014 International Conference on Advances in Electronics Computers and Communications, pp. 1–6 (2014)

  20. S. Kacprzak, B. Chwiećko, B. Ziółko, Speech/music discrimination for analysis of radio stations. in 2017 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–4 (2017)

  21. M.R. Kahrizi, Long-term multi-band frequency-domain mean-crossing rate (fdmcr) feature. https://doi.org/10.21227/H2NW6G

  22. M.R. Kahrizi, S.J. Kabudian, Long-term spectral pseudo-entropy (ltspe): a new robust feature for speech activity detection. J. Inf. Syst. Telecommun. (JIST) 6(4), 204–208 (2018). https://doi.org/10.7508/jist.2018.04.003

    Article  Google Scholar 

  23. M.R. Kahrizi, S.J. Kabudian, Projectiles optimization: A novel metaheuristic algorithm for global optimiaztion. Int. J. Eng. (IJE) IJE Trans. A Basics 33(10), 1924–1938 (2020). https://doi.org/10.5829/ije.2020.33.10a.11

    Article  Google Scholar 

  24. B.K. Khonglah, S.M. Prasanna, Speech/music classification using speech-specific features. Digit. Signal Process. 48, 71–83 (2016)

    Article  MathSciNet  Google Scholar 

  25. B.K. Khonglah, R. Sharma, S.M. Prasanna, Speech vs music discrimination using empirical mode decomposition. in 2015 Twenty First National Conference on Communications (NCC), pp. 1–6 (2015)

  26. A.A. Khudavand, S. Chikkamath, S. Nirmala, N. Iyer, Music/non-music discrimination using convolutional neural networks, in Soft Computing and Signal Processing. ed. by V.S. Reddy, V.K. Prasad, J. Wang, K.T.V. Reddy (Springer Singapore, Singapore, 2021), pp.17–28

    Google Scholar 

  27. S.J. Kim, A. Magnani, S. Boyd, Robust fisher discriminant analysis. in: Advances in neural information processing systems, pp. 659–666 (2006)

  28. A. Makur, S.K. Mitra, Warped discrete-fourier transform: Theory and applications. IEEE Trans. Circ .Syst. I Fundam. Theory Appl. 48(9), 1086–1093 (2001)

    MathSciNet  MATH  Google Scholar 

  29. V. Malenovsky, T. Vaillancourt, W. Zhe, K. Choo, V. Atti, Two-stage speech/music classifier with decision smoothing and sharpening in the evs codec. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5718–5722 (2015)

  30. O.M. Mubarak, E. Ambikairajah, J. Epps, Novel features for effective speech and music discrimination. in 2006 IEEE International Conference on Engineering of Intelligent Systems, pp. 1–5 (2006)

  31. J.E. Muñoz-Exposito, S. Garcia-Galan, N. Ruiz-Reyes, P. Vera-Candeas, F. Rivas-Peña, Speech music discrimination using a single warped lpc-based feature. in Proc. ISMIR, vol. 5, pp. 16–25 (2005)

  32. M. Papakostas, T. Giannakopoulos, Speech-music discrimination using deep visual feature extractors. Expert Syst. Appl. 114, 334–344 (2018)

    Article  Google Scholar 

  33. G. Peeters, A large set of audio features for sound description (similarity and classification) in the cuidado project (2004)

  34. J. Pinquier, J.L. Rouas, R. André-Obrecht, A fusion study in speech/music classification. in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., vol. 2, pp. II–17 (2003)

  35. J. Ramırez, J.C. Segura, C. Benıtez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3–4), 271–287 (2004)

    Article  Google Scholar 

  36. S.O. Sadjadi, J.H. Hansen, Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process. Lett. 20(3), 197–200 (2013)

    Article  Google Scholar 

  37. J. Saunders, Real-time discrimination of broadcast speech/music. in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 993–996 (1996)

  38. E. Scheirer, M. Slaney, Construction and evaluation of a robust multifeature speech/music discriminator. in 1997 IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 1331–1334 (1997)

  39. G. Sell, P. Clark, Music tonality features for speech/music discrimination. in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2489–2493 (2014)

  40. B. Thompson, Discrimination between singing and speech in real-world audio. in 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 407–412 (2014)

  41. W.H. Tsai, C.H. Ma, Automatic speech and singing discrimination for audio data indexing. in Big Data Applications and Use Cases, pp. 33–47. (Springer, 2016)

  42. A. Tsiartas, T. Chaspari, N. Katsamanis, P.K. Ghosh, M. Li, M. Van Segbroeck, A. Potamianos, S. Narayanan, Multi-band long-term signal variability features for robust voice activity detection. in Interspeech, pp. 718–722 (2013)

  43. N. Tsipas, L. Vrysis, C. Dimoulas, G. Papanikolaou, Efficient audio-driven multimedia indexing through similarity-based speech/music discrimination. Multimed. Tools Appl. 76(24), 25603–25621 (2017)

    Article  Google Scholar 

  44. G. Tzanetakis, P. Cook, Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)

    Article  Google Scholar 

  45. E. Wieser, M. Husinsky, M. Seidl, Speech/music discrimination in a large database of radio broadcasts from the wild. in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2134–2138 (2014)

  46. G. Williams, D.P. Ellis, Speech/music discrimination based on posterior probability features. in Sixth European Conference on Speech Communication and Technology (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Rasoul Kahrizi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kahrizi, M.R., Kabudian, S.J. Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR): A Novel Feature Extraction Algorithm for Speech/Music Discrimination. Circuits Syst Signal Process 42, 6929–6950 (2023). https://doi.org/10.1007/s00034-023-02440-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02440-0

Keywords

Navigation