Skip to main content
Log in

A speaker identification-verification approach for noise-corrupted and improved speech using fusion features and a convolutional neural network

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

The degraded quality of the speech input signal has a negative impact on speaker recognition techniques. We address the issues of speaker recognition from noise-corrupted audio signals in the presence of four noise variants, including factory noise, car noise, street traffic noise, and voice babble noise, as well as noise-suppressed enhanced speech. The goal of this research is to create a speaker recognition algorithm that is resistant to a diverse spectrum of speech capture quality, background scenarios, and interferences. In this work, three distinct features, including Mel Frequency Cepstral Coefficients (MFCC), Normalized Pitch Frequency (NPF), and Normalized Phase Cepstral Coefficients (NPCC) are combined. The analysis that MFCC, NPF, and NPCC illustrate distinct features of speech underlies our observation. A Convolutional Neural Network (CNN) is used in our speaker recognition strategy to learn speaker-dependent attributes from fragments of Mel features, normalized pitch features, and phase cepstral features of clean speech, corrupted speech, and enhanced speech. The performance is measured using the ITU-T test signals and compared to previous algorithms at different Signal-to-Noise-Ratios of 0 dB, 5 dB, 10 dB, and 15 dB. For enhanced speech, all three features, MFCC, NPF, and NPCC, provided productive speaker identification and verification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available in the [ITU-T Test Signals for Telecommunication Systems] repository [https://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm].

Notes

  1. https://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm.

  2. https://svr-www.eng.cam.ac.uk/comp.speech/Section1/Data/noisex.html.

References

  1. Jayanna HS, Prasanna SM (2009) Analysis, feature extraction, modeling and testing techniques for speaker recognition. IETE Tech Rev 26(3):181–190. https://doi.org/10.4103/0256-4602.50702

    Article  Google Scholar 

  2. Singh N, Khan RA, Shree R (2012) MFCC and prosodic feature extraction techniques: a comparative study. Int J Comput Appl 54(1):9–13

    Google Scholar 

  3. Hasan M R, Jamil M, Rabbani MG, Rahman MS (2004) Speaker identification using Mel frequency cepstral coefficients. In: ICECE international conference on electrical & computer engineering, December 2004, pp 565–568

  4. Krishnamurthy N, Hansen JH (2009) Babble noise: modeling, analysis, and applications. IEEE Trans Audio Speech Lang Process 17(7):1394–1407. https://doi.org/10.1109/TASL.2009.2015084

    Article  Google Scholar 

  5. Yutai W, Bo L, Xiaoqing J et al (2009) Speaker recognition based on dynamic MFCC parameters. In: IEEE international conference on image analysis and signal processing, April 2009. pp 406–409. https://doi.org/10.1109/IASP.2009.5054638

  6. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted Gaussian mixture models. Digit Signal Process 10(1–3):19–41. https://doi.org/10.1006/dspr.1999.0361

    Article  Google Scholar 

  7. Campbell WM, Campbell JP, Reynolds DA et al (2006) Support vector machines for speaker and language recognition. Comput Speech Lang 20(2–3):210–229. https://doi.org/10.1016/j.csl.2005.06.003

    Article  Google Scholar 

  8. Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process Lett 13(5):308–311. https://doi.org/10.1109/LSP.2006.870086

    Article  Google Scholar 

  9. Dehak N, Dehak R, Glass JR et al (2010) Cosine similarity scoring without score normalization techniques. In: Odyssey, June 2010. p 15

  10. Daqrouq K, Tutunji TA (2015) Speaker identification using vowels features through a combined method of formants, wavelets, and neural network classifiers. Appl Soft Comput 27:231–239. https://doi.org/10.1016/j.asoc.2014.11.016

    Article  Google Scholar 

  11. Ajmera PK, Jadhav DV, Holambe RS (2011) Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit 44(10–11):2749–2759. https://doi.org/10.1016/j.patcog.2011.04.009

    Article  Google Scholar 

  12. Tirumala SS, Shahamiri SR, Garhwal AS, Wang R (2017) Speaker identification features extraction methods: a systematic review. Expert Syst Appl 90:250–271. https://doi.org/10.1016/j.eswa.2017.08.015

    Article  Google Scholar 

  13. Jia Y, Chen X, Yu J et al (2021) Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network. Complex Intell Syst 7:1749–1757. https://doi.org/10.1007/s40747-020-00172-1

    Article  Google Scholar 

  14. Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675. https://doi.org/10.1109/LSP.2015.2420092

    Article  Google Scholar 

  15. Ahmad KS, Thosar AS, Nirmal JH, Pande VS (2015) A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: IEEE eighth international conference on advances in pattern recognition, January 2015. pp 1–6. https://doi.org/10.1109/ICAPR.2015.7050669

  16. Soleymanpour M, Marvi H (2017) Text-independent speaker identification based on selection of the most similar feature vectors. Int J Speech Technol 20:99–108. https://doi.org/10.1007/s10772-016-9385-x

    Article  Google Scholar 

  17. Liu Z, Wu Z, Li T, Li J, Shen C (2018) GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans Industr Inform 14(7):3244–3252. https://doi.org/10.1109/TII.2018.2799928

    Article  Google Scholar 

  18. Ali H, Tran SN, Benetos E et al (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Appl 29:13–19. https://doi.org/10.1007/s00521-016-2501-7

    Article  Google Scholar 

  19. Siam AI, El-khobby HA, Elnaby MMA et al (2019) A novel speech enhancement method using Fourier series decomposition and spectral subtraction for robust speaker identification. Wirel Pers Commun 108:1055–1068. https://doi.org/10.1007/s11277-019-06453-4

    Article  Google Scholar 

  20. Kenny P (2010) Bayesian speaker verification with, heavy tailed priors. In: Proceedings Odyssey, 2010

  21. Taherian H, Wang ZQ, Chang J, Wang D (2020) Robust speaker recognition based on single-channel and multi-channel speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 28:1293–1302. https://doi.org/10.1109/TASLP.2020.2986896

    Article  Google Scholar 

  22. El-Moneim SA, Nassar MA, Dessouky MI et al (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimedia Tools Appl 79:24013–24028. https://doi.org/10.1007/s11042-019-08293-7

    Article  Google Scholar 

  23. Hourri S, Nikolov NS, Kharroubi J (2021) Convolutional neural network vectors for speaker recognition. Int J Speech Technol 24:389–400. https://doi.org/10.1007/s10772-021-09795-2

    Article  Google Scholar 

  24. Juneja K (2022) Two-level noise robust and block featured PNN model for speaker recognition in real environment. Wirel Pers Commun 125(4):3741–3771. https://doi.org/10.1007/s11277-022-09734-7

    Article  Google Scholar 

  25. Hamidi M, Zealouk O, Satori H et al (2023) COVID-19 assessment using HMM cough recognition system. Int J Inf Technol 15(1):193–201. https://doi.org/10.1007/s41870-022-01120-7

    Article  Google Scholar 

  26. Al-Shakarchy ND, Obayes HK, Abdullah ZN (2023) Person identification based on voice biometric using deep neural network. Int J Inf Technol 15(2):789–795. https://doi.org/10.1007/s41870-022-01142-1

    Article  Google Scholar 

  27. Radha K, Bansal M (2023) Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children. Int J Inf Technol 15(3):1375–1385. https://doi.org/10.1007/s41870-023-01224-8

    Article  Google Scholar 

  28. Chelali FZ (2023) Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment. Int J Inf Technol. https://doi.org/10.1007/s41870-023-01291-x

    Article  Google Scholar 

  29. Nakagawa S, Wang L, Ohtsuka S (2011) Speaker identification and verification by combining MFCC and phase information. IEEE Trans Audio Speech Lang Process 20(4):1085–1095. https://doi.org/10.1109/TASL.2011.2172422

    Article  Google Scholar 

  30. Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition. In: Thirteenth annual conference of the international speech communication association, 2012

  31. ITU-T P-series recommendations. https://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm. Accessed 26 July 2020

  32. Gibiansky A, Arik S, Diamos G et al (2017) Deep voice 2: multi-speaker neural text-to-speech. Adv Neural Inf Process 30

  33. Nisa R, Showkat H, Baba A (2023) The speech signal enhancement approach with multiple sub-frames analysis for complex magnitude and phase spectrum recompense. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2023.120746

    Article  Google Scholar 

  34. Paliwal K, Wójcicki K (2008) Effect of analysis window duration on speech intelligibility. IEEE Signal Process Lett 15:785–788. https://doi.org/10.1109/LSP.2008.2005755

    Article  Google Scholar 

  35. Varga A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251. https://doi.org/10.1016/0167-6393(93)90095-3

    Article  Google Scholar 

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rohun Nisa.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nisa, R., Baba, A.M. A speaker identification-verification approach for noise-corrupted and improved speech using fusion features and a convolutional neural network. Int. j. inf. tecnol. (2024). https://doi.org/10.1007/s41870-024-01877-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41870-024-01877-z

Keywords

Navigation