Abstract
Multimedia data have increased dramatically today, making the distinction between desirable information and other types of information extremely important. Speech/music discrimination is a field of audio analytics that aims to detect and classify speech and music segments in an audio file. This paper proposes a novel feature extraction method called Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR). The proposed feature computes the average frequency-domain mean-crossing rate along the frequency axis for each of the perceptual Mel-scaled frequency bands of the signal power spectrum. In this paper, the class-separation capability of this feature is first measured by well-known divergence criteria such as Maximum Fisher Discriminant Ratio (MFDR), Bhattacharyya divergence, and Jeffreys/Symmetric Kullback–Leibler (SKL) divergence. The proposed feature is then applied to the speech/music discrimination (SMD) process on two well-known speech-music datasets—GTZAN and S &S (Scheirer and Slaney). The results obtained on the two datasets using conventional classifiers, including k-NN, GMM, and SVM, as well as deep learning-based classification methods, including CNN, LSTM, and BiLSTM, show that the proposed feature outperforms other features in speech/music discrimination.
Similar content being viewed by others
Notes
The source code of the FDMCR has been registered as "Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR) feature" on IEEE DataPort [21] with this DOI: "10.21227/H2NW6G".
Available: “http://opihi.cs.uvic.ca/sound/music_speech.tar.gz". Accessed: 3/13/2021.
Accessible from this address: “http://www.ee.columbia.edu/\(\sim \)dpwe /sounds/musp/".
References
K.T. Abou-Moustafa, F.P. Ferrie, A note on metric properties for some divergence measures: The gaussian case. in Asian Conference on Machine Learning, pp. 1–15 (2012)
F. Alías, J. Socoró, X. Sevillano, A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Appl. Sci. 6(5), 143 (2016)
G. Aneeja, B. Yegnanarayana, Single frequency filtering approach for discriminating speech and nonspeech. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 705–717 (2015)
M. Anusuya, S. Katti, Front end analysis of speech recognition: a review. Int. J. Speech Technol. 14(2), 99–145 (2011)
R.G. Balamurali, C. Rajagopal, Speech/music discrimination (2017). US Patent 9,613,640
A.L. Berenzweig, D.P. Ellis, Locating singing voice segments within music signals. in Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575), pp. 119–122 (2001)
M. Bhattacharjee, S.M. Prasanna, P. Guha, Speech/music classification using features from spectral peaks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1549–1559 (2020)
G.K. Birajdar, M.D. Patil, Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimed. Tools Appl. 78(11), 15141–15168 (2019)
G.K. Birajdar, M.D. Patil, Speech/music classification using visual and spectral chromagram features. J. Ambient Intell. Hum. Comput. 11, 1–19 (2019)
M.J. Carey, E.S. Parris, H. Lloyd-Thomas, A comparison of features for speech, music discrimination. in 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152 (1999)
A. Chen, M.A. Hasegawa-Johnson, Mixed stereo audio classification using a stereo-input mixed-to-panned level feature. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 2025–2033 (2014)
T. Drugman, Y. Stylianou, Y. Kida, M. Akamine, Voice activity detection: merging source and filter-based information. IEEE Signal Process. Lett. 23(2), 252–256 (2015)
S. Duan, J. Zhang, P. Roe, M. Towsey, A survey of tagging techniques for music, speech and environmental sound. Artif. Intell. Rev. 42(4), 637–661 (2014)
K. El-Maleh, M. Klein, G. Petrucci, P. Kabal, Speech/music discrimination for multimedia applications. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 4, pp. 2445–2448 (2000)
G. Fuchs, A robust speech/music discriminator for switched audio coding. in 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 569–573 (2015)
P.K. Ghosh, A. Tsiartas, S. Narayanan, Robust voice activity detection using long-term signal variability. IEEE Trans. Audio Speech Lang. Process. 19(3), 600–613 (2010)
P. Gimeno, I. Viñals, A. Ortega, A. Miguel, E. Lleida, Multiclass audio segmentation based on recurrent neural networks for broadcast domain data. EURASIP J. Audio Speech Music Process. 2020(1), 1–19 (2020)
B.Y. Jang, W.H. Heo, J.H. Kim, O.W. Kwon, Music detection from broadcast contents using convolutional neural networks with a mel-scale kernel. EURASIP J. Audio Speech Music Process. 2019(1), 11 (2019)
M. Joshi, S. Nadgir, Extraction of feature vectors for analysis of musical instruments. in 2014 International Conference on Advances in Electronics Computers and Communications, pp. 1–6 (2014)
S. Kacprzak, B. Chwiećko, B. Ziółko, Speech/music discrimination for analysis of radio stations. in 2017 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–4 (2017)
M.R. Kahrizi, Long-term multi-band frequency-domain mean-crossing rate (fdmcr) feature. https://doi.org/10.21227/H2NW6G
M.R. Kahrizi, S.J. Kabudian, Long-term spectral pseudo-entropy (ltspe): a new robust feature for speech activity detection. J. Inf. Syst. Telecommun. (JIST) 6(4), 204–208 (2018). https://doi.org/10.7508/jist.2018.04.003
M.R. Kahrizi, S.J. Kabudian, Projectiles optimization: A novel metaheuristic algorithm for global optimiaztion. Int. J. Eng. (IJE) IJE Trans. A Basics 33(10), 1924–1938 (2020). https://doi.org/10.5829/ije.2020.33.10a.11
B.K. Khonglah, S.M. Prasanna, Speech/music classification using speech-specific features. Digit. Signal Process. 48, 71–83 (2016)
B.K. Khonglah, R. Sharma, S.M. Prasanna, Speech vs music discrimination using empirical mode decomposition. in 2015 Twenty First National Conference on Communications (NCC), pp. 1–6 (2015)
A.A. Khudavand, S. Chikkamath, S. Nirmala, N. Iyer, Music/non-music discrimination using convolutional neural networks, in Soft Computing and Signal Processing. ed. by V.S. Reddy, V.K. Prasad, J. Wang, K.T.V. Reddy (Springer Singapore, Singapore, 2021), pp.17–28
S.J. Kim, A. Magnani, S. Boyd, Robust fisher discriminant analysis. in: Advances in neural information processing systems, pp. 659–666 (2006)
A. Makur, S.K. Mitra, Warped discrete-fourier transform: Theory and applications. IEEE Trans. Circ .Syst. I Fundam. Theory Appl. 48(9), 1086–1093 (2001)
V. Malenovsky, T. Vaillancourt, W. Zhe, K. Choo, V. Atti, Two-stage speech/music classifier with decision smoothing and sharpening in the evs codec. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5718–5722 (2015)
O.M. Mubarak, E. Ambikairajah, J. Epps, Novel features for effective speech and music discrimination. in 2006 IEEE International Conference on Engineering of Intelligent Systems, pp. 1–5 (2006)
J.E. Muñoz-Exposito, S. Garcia-Galan, N. Ruiz-Reyes, P. Vera-Candeas, F. Rivas-Peña, Speech music discrimination using a single warped lpc-based feature. in Proc. ISMIR, vol. 5, pp. 16–25 (2005)
M. Papakostas, T. Giannakopoulos, Speech-music discrimination using deep visual feature extractors. Expert Syst. Appl. 114, 334–344 (2018)
G. Peeters, A large set of audio features for sound description (similarity and classification) in the cuidado project (2004)
J. Pinquier, J.L. Rouas, R. André-Obrecht, A fusion study in speech/music classification. in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., vol. 2, pp. II–17 (2003)
J. Ramırez, J.C. Segura, C. Benıtez, A. De La Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3–4), 271–287 (2004)
S.O. Sadjadi, J.H. Hansen, Unsupervised speech activity detection using voicing measures and perceptual spectral flux. IEEE Signal Process. Lett. 20(3), 197–200 (2013)
J. Saunders, Real-time discrimination of broadcast speech/music. in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 2, pp. 993–996 (1996)
E. Scheirer, M. Slaney, Construction and evaluation of a robust multifeature speech/music discriminator. in 1997 IEEE international conference on acoustics, speech, and signal processing, vol. 2, pp. 1331–1334 (1997)
G. Sell, P. Clark, Music tonality features for speech/music discrimination. in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2489–2493 (2014)
B. Thompson, Discrimination between singing and speech in real-world audio. in 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 407–412 (2014)
W.H. Tsai, C.H. Ma, Automatic speech and singing discrimination for audio data indexing. in Big Data Applications and Use Cases, pp. 33–47. (Springer, 2016)
A. Tsiartas, T. Chaspari, N. Katsamanis, P.K. Ghosh, M. Li, M. Van Segbroeck, A. Potamianos, S. Narayanan, Multi-band long-term signal variability features for robust voice activity detection. in Interspeech, pp. 718–722 (2013)
N. Tsipas, L. Vrysis, C. Dimoulas, G. Papanikolaou, Efficient audio-driven multimedia indexing through similarity-based speech/music discrimination. Multimed. Tools Appl. 76(24), 25603–25621 (2017)
G. Tzanetakis, P. Cook, Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)
E. Wieser, M. Husinsky, M. Seidl, Speech/music discrimination in a large database of radio broadcasts from the wild. in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2134–2138 (2014)
G. Williams, D.P. Ellis, Speech/music discrimination based on posterior probability features. in Sixth European Conference on Speech Communication and Technology (1999)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kahrizi, M.R., Kabudian, S.J. Long-Term Multi-band Frequency-Domain Mean-Crossing Rate (FDMCR): A Novel Feature Extraction Algorithm for Speech/Music Discrimination. Circuits Syst Signal Process 42, 6929–6950 (2023). https://doi.org/10.1007/s00034-023-02440-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02440-0