Abstract
The degraded quality of the speech input signal has a negative impact on speaker recognition techniques. We address the issues of speaker recognition from noise-corrupted audio signals in the presence of four noise variants, including factory noise, car noise, street traffic noise, and voice babble noise, as well as noise-suppressed enhanced speech. The goal of this research is to create a speaker recognition algorithm that is resistant to a diverse spectrum of speech capture quality, background scenarios, and interferences. In this work, three distinct features, including Mel Frequency Cepstral Coefficients (MFCC), Normalized Pitch Frequency (NPF), and Normalized Phase Cepstral Coefficients (NPCC) are combined. The analysis that MFCC, NPF, and NPCC illustrate distinct features of speech underlies our observation. A Convolutional Neural Network (CNN) is used in our speaker recognition strategy to learn speaker-dependent attributes from fragments of Mel features, normalized pitch features, and phase cepstral features of clean speech, corrupted speech, and enhanced speech. The performance is measured using the ITU-T test signals and compared to previous algorithms at different Signal-to-Noise-Ratios of 0 dB, 5 dB, 10 dB, and 15 dB. For enhanced speech, all three features, MFCC, NPF, and NPCC, provided productive speaker identification and verification performance.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available in the [ITU-T Test Signals for Telecommunication Systems] repository [https://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm].
References
Jayanna HS, Prasanna SM (2009) Analysis, feature extraction, modeling and testing techniques for speaker recognition. IETE Tech Rev 26(3):181–190. https://doi.org/10.4103/0256-4602.50702
Singh N, Khan RA, Shree R (2012) MFCC and prosodic feature extraction techniques: a comparative study. Int J Comput Appl 54(1):9–13
Hasan M R, Jamil M, Rabbani MG, Rahman MS (2004) Speaker identification using Mel frequency cepstral coefficients. In: ICECE international conference on electrical & computer engineering, December 2004, pp 565–568
Krishnamurthy N, Hansen JH (2009) Babble noise: modeling, analysis, and applications. IEEE Trans Audio Speech Lang Process 17(7):1394–1407. https://doi.org/10.1109/TASL.2009.2015084
Yutai W, Bo L, Xiaoqing J et al (2009) Speaker recognition based on dynamic MFCC parameters. In: IEEE international conference on image analysis and signal processing, April 2009. pp 406–409. https://doi.org/10.1109/IASP.2009.5054638
Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Signal Process 10(1–3):19–41. https://doi.org/10.1006/dspr.1999.0361
Campbell WM, Campbell JP, Reynolds DA et al (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20(2–3):210–229. https://doi.org/10.1016/j.csl.2005.06.003
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311. https://doi.org/10.1109/LSP.2006.870086
Dehak N, Dehak R, Glass JR et al (2010) Cosine similarity scoring without score normalization techniques. In: Odyssey, June 2010. p 15
Daqrouq K, Tutunji TA (2015) Speaker identification using vowels features through a combined method of formants, wavelets, and neural network classifiers. Appl Soft Comput 27:231–239. https://doi.org/10.1016/j.asoc.2014.11.016
Ajmera PK, Jadhav DV, Holambe RS (2011) Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit 44(10–11):2749–2759. https://doi.org/10.1016/j.patcog.2011.04.009
Tirumala SS, Shahamiri SR, Garhwal AS, Wang R (2017) Speaker identification features extraction methods: a systematic review. Expert Syst Appl 90:250–271. https://doi.org/10.1016/j.eswa.2017.08.015
Jia Y, Chen X, Yu J et al (2021) Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network. Complex Intell Syst 7:1749–1757. https://doi.org/10.1007/s40747-020-00172-1
Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675. https://doi.org/10.1109/LSP.2015.2420092
Ahmad KS, Thosar AS, Nirmal JH, Pande VS (2015) A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: IEEE eighth international conference on advances in pattern recognition, January 2015. pp 1–6. https://doi.org/10.1109/ICAPR.2015.7050669
Soleymanpour M, Marvi H (2017) Text-independent speaker identification based on selection of the most similar feature vectors. Int J Speech Technol 20:99–108. https://doi.org/10.1007/s10772-016-9385-x
Liu Z, Wu Z, Li T, Li J, Shen C (2018) GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans Industr Inform 14(7):3244–3252. https://doi.org/10.1109/TII.2018.2799928
Ali H, Tran SN, Benetos E et al (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Appl 29:13–19. https://doi.org/10.1007/s00521-016-2501-7
Siam AI, El-khobby HA, Elnaby MMA et al (2019) A novel speech enhancement method using Fourier series decomposition and spectral subtraction for robust speaker identification. Wirel Pers Commun 108:1055–1068. https://doi.org/10.1007/s11277-019-06453-4
Kenny P (2010) Bayesian speaker verification with, heavy tailed priors. In: Proceedings Odyssey, 2010
Taherian H, Wang ZQ, Chang J, Wang D (2020) Robust speaker recognition based on single-channel and multi-channel speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 28:1293–1302. https://doi.org/10.1109/TASLP.2020.2986896
El-Moneim SA, Nassar MA, Dessouky MI et al (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimedia Tools Appl 79:24013–24028. https://doi.org/10.1007/s11042-019-08293-7
Hourri S, Nikolov NS, Kharroubi J (2021) Convolutional neural network vectors for speaker recognition. Int J Speech Technol 24:389–400. https://doi.org/10.1007/s10772-021-09795-2
Juneja K (2022) Two-level noise robust and block featured PNN model for speaker recognition in real environment. Wirel Pers Commun 125(4):3741–3771. https://doi.org/10.1007/s11277-022-09734-7
Hamidi M, Zealouk O, Satori H et al (2023) COVID-19 assessment using HMM cough recognition system. Int J Inf Technol 15(1):193–201. https://doi.org/10.1007/s41870-022-01120-7
Al-Shakarchy ND, Obayes HK, Abdullah ZN (2023) Person identification based on voice biometric using deep neural network. Int J Inf Technol 15(2):789–795. https://doi.org/10.1007/s41870-022-01142-1
Radha K, Bansal M (2023) Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children. Int J Inf Technol 15(3):1375–1385. https://doi.org/10.1007/s41870-023-01224-8
Chelali FZ (2023) Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment. Int J Inf Technol. https://doi.org/10.1007/s41870-023-01291-x
Nakagawa S, Wang L, Ohtsuka S (2011) Speaker identification and verification by combining MFCC and phase information. IEEE Trans Audio Speech Lang Process 20(4):1085–1095. https://doi.org/10.1109/TASL.2011.2172422
Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Thirteenth annual conference of the international speech communication association, 2012
ITU-T P-series recommendations. https://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm. Accessed 26 July 2020
Gibiansky A, Arik S, Diamos G et al (2017) Deep voice 2: multi-speaker neural text-to-speech. Adv Neural Inf Process 30
Nisa R, Showkat H, Baba A (2023) The speech signal enhancement approach with multiple sub-frames analysis for complex magnitude and phase spectrum recompense. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2023.120746
Paliwal K, Wójcicki K (2008) Effect of analysis window duration on speech intelligibility. IEEE Signal Process Lett 15:785–788. https://doi.org/10.1109/LSP.2008.2005755
Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251. https://doi.org/10.1016/0167-6393(93)90095-3
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nisa, R., Baba, A.M. A speaker identification-verification approach for noise-corrupted and improved speech using fusion features and a convolutional neural network. Int. j. inf. tecnol. (2024). https://doi.org/10.1007/s41870-024-01877-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41870-024-01877-z