Skip to main content
Log in

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speech-based speaker identification system is one of the alternatives to the conventional biometric contact-based identification systems. Recent works demonstrate the growing interest among researchers in this field and highlight the practical usability of speech for speaker identification across various applications. In this work, we try to address the limitations in the existing state-of-the-art approaches and highlight the usability of convolutional neural networks for speaker identification systems. The present work examines the usage of spectrogram as an input to these spatial networks and its robustness in the presence of noise. For faster training (computation) and to reduce the memory requirement (storage), SpectroNet model for speech-based speaker identification is introduced in this work. Evaluation of the proposed system has been done using Voxceleb1 and Part1 of the RSR 2015 databases. Experimental results show a relative improvement of ~ 16% (accuracy—96.21%) with spectrogram and ~ 10% (accuracy—98.92%) with log Mel spectrogram in identifying the speaker compared to the existing models. When cochleagram was used, it results in an identification accuracy of 99.26%. Analyzing the result obtained shows the applicability of the proposed approach in situations where (i) minimal speech data are available for speaker identification; (ii) speech data are noisy in nature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

No associated data.

Notes

  1. At https://github.com/banalasaritha/Time–Frequency-Representations-of-RSR2015-Database.

References:

  1. P.K. Ajmera, D.V. Jadhav, R.S. Holambe, Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011). https://doi.org/10.1016/j.patcog.2011.04.009

    Article  ADS  Google Scholar 

  2. N.N. An, N.Q. Thanh, Y. Liu, Deep CNNs with self-attention for speaker identification. IEEE Access 7, 85327–85337 (2019). https://doi.org/10.1109/ACCESS.2019.2917470

    Article  Google Scholar 

  3. T. Arias-Vergara, P. Klumpp, J.C. Vasquez-Correa, E. Nöth, J.R. Orozco-Arroyave, M. Schuster, Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal. Appl. 24(2), 423–431 (2021). https://doi.org/10.1007/s10044-020-00921-5

    Article  Google Scholar 

  4. A. Ashar, M.S. Bhatti, U. Mushtaq, Speaker identification using a hybrid CNN-MFCC approach, in 2020 International Conference on Emerging Trends in Smart Technologies ICETST 2020, 2020. https://doi.org/10.1109/ICETST49965.2020.9080730

  5. H. Beigi, Speaker recognition: advancements and challenges, in New Trends Dev. Biometrics, pp. 3–30, 2012. https://doi.org/10.5772/52023

  6. S. Bunrit, T. Inkian, N. Kerdprasop, K. Kerdprasop, Text-independent speaker identification using deep learning model of convolution neural network. Int. J. Mach. Learn. Comput. 9(2), 143–148 (2019). https://doi.org/10.18178/ijmlc.2019.9.2.778

    Article  Google Scholar 

  7. W. Cai, J. Chen, M. Li, Exploring the encoding layer and loss function in end-to-end speaker and language recognition system, in The Speaker and Language Recognition Workshop (Odyssey 2018), pp. 74–81, 2018. https://doi.org/10.21437/Odyssey.2018-11

  8. G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, No. i, pp. 4087–4091, 2014. https://doi.org/10.1109/ICASSP.2014.6854370

  9. S. Ding, T. Chen, X. Gong, W. Zha, Z. Wang, AutoSpeech: Neural architecture search for speaker recognition, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020-October, pp. 916–920, 2020. https://doi.org/10.21437/Interspeech.2020-1258

  10. S.A. El-Moneim, M.A. Nassar, M.I. Dessouky, N.A. Ismail, A.S. El-Fishawy, F.E. Abd El-Samie, Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 79(33–34), 24013–24028 (2020). https://doi.org/10.1007/s11042-019-08293-7

    Article  Google Scholar 

  11. S. Farsiani, H. Izadkhah, S. Lotfi, An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 100, 107882 (2022). https://doi.org/10.1016/j.compeleceng.2022.107882

    Article  Google Scholar 

  12. M. Hajibabaei, D. Dai, Unified hypersphere embedding for speaker recognition. Electr. Eng. Syst. Sci. Audio Speech Process. (2018). https://doi.org/10.48550/arXiv.1807.08312

    Article  Google Scholar 

  13. M.R. Hasan, M.M. Hasan, M.Z. Hossain, How many Mel-frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language. J. Eng. 2021(12), 817–827 (2021). https://doi.org/10.1049/tje2.12082

    Article  Google Scholar 

  14. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 45, no. 8, pp. 770–778, 2016. https://doi.org/10.1109/CVPR.2016.90

  15. R. Jahangir, Y.W. Teh, H.F. Nweke, G. Mujtaba, M.A. Al-Garadi, I. Ali, Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Syst. Appl. 171, 114591 (2021). https://doi.org/10.1016/j.eswa.2021.114591

    Article  Google Scholar 

  16. J.W. Jung, H.S. Heo, J.H. Kim, H.J. Shim, H.J. Yu, RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 1268–1272, 2019. https://doi.org/10.21437/Interspeech.2019-1982

  17. J.W. Jung, H.S. Heo, I.H. Yang, H.J. Shim, H.J. Yu, A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2018-April, pp. 5349–5353, 2018. https://doi.org/10.1109/ICASSP.2018.8462575

  18. M. Karu, T. Alumäe, Weakly supervised training of speaker identification models, in Speak. Lang. Recognit. Work. ODYSSEY 2018, pp. 24–30, 2018. https://doi.org/10.21437/Odyssey.2018-4

  19. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386

    Article  Google Scholar 

  20. A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014). https://doi.org/10.1016/j.specom.2014.03.001

    Article  Google Scholar 

  21. M.A. Laskar, R.H. Laskar, Integrating DNN–HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38(8), 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3

    Article  Google Scholar 

  22. M.A. Laskar, R.H. Laskar, HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification. Expert Syst. Appl. 182, 115281 (2021). https://doi.org/10.1016/j.eswa.2021.115281

    Article  Google Scholar 

  23. Q.V. Le, M. Tan, EfficientNet: rethinking model scaling for convolutional neural networks. Can. J. Emerg. Med. 15(3), 190 (2013). https://doi.org/10.48550/arXiv.1905.11946

    Article  ADS  Google Scholar 

  24. J. Lee, H.G. Kang, Two-stage refinement of magnitude and complex spectra for real-time speech enhancement. IEEE Signal Process. Lett. 29, 2188–2192 (2022). https://doi.org/10.1109/LSP.2022.3215100

    Article  ADS  Google Scholar 

  25. T. Matsui, S. Furui, Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMM’s. IEEE Trans. Speech Audio Process. 2(3), 456–459 (1994). https://doi.org/10.1109/89.294363

    Article  Google Scholar 

  26. H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007

    Article  Google Scholar 

  27. A. Nagraniy, J.S. Chungy, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-August, pp. 2616–2620, 2017. https://doi.org/10.21437/Interspeech.2017-950

  28. P.K. Nayana, D. Mathew, A. Thomas, Comparison of text independent speaker identification systems using GMM and i-vector methods. Procedia Comput. Sci. 115, 47–54 (2017). https://doi.org/10.1016/j.procs.2017.09.075

    Article  Google Scholar 

  29. I. Orović, S. Stanković, N. Žarić, Robust speech watermarking procedure in the time-frequency domain. EURASIP J. Adv. Signal Process. 2008, 1–9 (2008). https://doi.org/10.1155/2008/519206

    Article  Google Scholar 

  30. R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, M. Allerhand, Complex sounds and auditory images, in Auditory Physiology and Perception, vol 83 (Elsevier, 1992), pp. 429–446. https://doi.org/10.1016/B978-0-08-041847-6.50054-X

  31. S. Pruzansky, Pattern-matching procedure for automatic talker recognition. J. Acoust. Soc. Am. 35(3), 354–358 (1963). https://doi.org/10.1121/1.1918467

    Article  ADS  Google Scholar 

  32. M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in 2018 IEEE spoken language technology workshop SLT 2018, pp. 1021–1028, 2019. https://doi.org/10.1109/SLT.2018.8639585

  33. D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, in ESCA Work. Autom. Speak. Recognition, Identification, Verif. ASRIV 1994, vol. 17, pp. 27–30, 2019. https://doi.org/10.1016/0167-6393(95)00009-D

  34. T.N. Sainath, B. Kingsbury, G. Saon, H. Soltau, G. Dahl, B. Ramabhadran, Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 64, 39–48 (2015). https://doi.org/10.1016/j.neunet.2014.08.005

    Article  PubMed  Google Scholar 

  35. D. Salvati, C. Drioli, G.L. Foresti, End-to-end speaker identification in noisy and reverberant environments using raw waveform convolutional neural networks, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 4335–4339, 2019. https://doi.org/10.21437/Interspeech.2019-2403

  36. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: inverted residuals and linear bottlenecks, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018. https://doi.org/10.1109/CVPR.2018.00474

  37. B. Saritha, N. Shome, R.H. Laskar, M. Choudhury, Enhancement in speaker recognition using sincnet through optimal window and frame shift, in 2022 2nd International Conference on Intelligent Technologies (CONIT), pp. 1–6, 2022. https://doi.org/10.1109/CONIT55038.2022.9848231

  38. R.V. Sharan, T.J. Moir, Subband time-frequency image texture features for robust audio surveillance. IEEE Trans. Inf. Forensics Secur. 10(12), 2605–2615 (2015). https://doi.org/10.1109/TIFS.2015.2469254

    Article  Google Scholar 

  39. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–14, 2015. https://doi.org/10.48550/arXiv.1409.1556

  40. H. Sinha, V. Awasthi, P.K. Ajmera, Audio classification using braided convolutional neural networks. IET Signal Process. 14(7), 448–454 (2020). https://doi.org/10.1049/iet-spr.2019.0381

    Article  Google Scholar 

  41. L. Stankovic, A method for time-frequency analysis. IEEE Trans. Signal Process. 42(1), 225–229 (1994). https://doi.org/10.1109/78.258146

    Article  ADS  MathSciNet  Google Scholar 

  42. S. Stanković, Time-frequency analysis and its application in digital watermarking. EURASIP J. Adv. Signal Process. 2010, 1–20 (2010). https://doi.org/10.1155/2010/579295

    Article  Google Scholar 

  43. M. Strake, B. Defraene, K. Fluyt, W. Tirry, T. Fingscheidt, Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration. EURASIP J. Adv. Signal Process. 2020(1), 1–26 (2020). https://doi.org/10.1186/s13634-020-00707-1

    Article  ADS  Google Scholar 

  44. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 91, no. 8, pp. 1–9, 2015. https://doi.org/10.1109/CVPR.2015.7298594

  45. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the Inception Architecture for Computer Vision, in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 2016-December, pp. 2818–2826, 2016. https://doi.org/10.1109/CVPR.2016.308

  46. S.S. Tirumala, S.R. Shahamiri, A deep autoencoder approach for speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 175–179, 2017. https://doi.org/10.1145/3163080.3163097

  47. S.S. Tirumala, S.R. Shahamiri, A review on deep learning approaches in speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 142–147, 2016. https://doi.org/10.1145/3015166.3015210

  48. Tobias Birnbaum (2023), s_method_pub, https://github.com/PhaseSpaceContinuum/S_method.

  49. Z. Wu, Z. Cao, Improved MFCC-based feature for robust speaker identification. Tsinghua Sci. Technol. 10(2), 158–161 (2005). https://doi.org/10.1016/S1007-0214(05)70048-1

    Article  Google Scholar 

  50. S. Yadav, A. Rai, Learning discriminative features for speaker identification and verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-September, no. April, pp. 2237–2241, 2018. https://doi.org/10.21437/Interspeech.2018-1015

  51. Z. Zhang, J. Geiger, J. Pohjalainen, A.E.D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 1–28 (2018). https://doi.org/10.1145/3178115

    Article  Google Scholar 

  52. X. Zhao, D. Wang, Analyzing noise robustness of MFCC and GFCC features in speaker identification, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208, 2013. https://doi.org/10.1109/ICASSP.2013.6639061

  53. B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures for scalable image recognition, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2018. https://doi.org/10.1109/CVPR.2018.00907

Download references

Acknowledgements

The authors thank the members of Speech and Image Processing Laboratory, National Institute of Technology Silchar for supporting the research. We thank the Editor-in–chief and the anonymous reviewers for all the valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Banala Saritha.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saritha, B., Laskar, M.A., Kirupakaran, A.M. et al. Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal. Circuits Syst Signal Process 43, 1839–1861 (2024). https://doi.org/10.1007/s00034-023-02542-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02542-9

Keywords

Navigation