Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Saritha, Banala; Laskar, Mohammad Azharuddin; Kirupakaran, Anish Monsley; Laskar, Rabul Hussain; Choudhury, Madhuchhanda; Shome, Nirupam

doi:10.1007/s00034-023-02542-9

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Published: 17 November 2023

Volume 43, pages 1839–1861, (2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Banala Saritha ORCID: orcid.org/0000-0002-6499-3561¹,
Mohammad Azharuddin Laskar¹,
Anish Monsley Kirupakaran¹,
Rabul Hussain Laskar¹,
Madhuchhanda Choudhury¹ &
…
Nirupam Shome²

268 Accesses
1 Citation
Explore all metrics

Abstract

Speech-based speaker identification system is one of the alternatives to the conventional biometric contact-based identification systems. Recent works demonstrate the growing interest among researchers in this field and highlight the practical usability of speech for speaker identification across various applications. In this work, we try to address the limitations in the existing state-of-the-art approaches and highlight the usability of convolutional neural networks for speaker identification systems. The present work examines the usage of spectrogram as an input to these spatial networks and its robustness in the presence of noise. For faster training (computation) and to reduce the memory requirement (storage), SpectroNet model for speech-based speaker identification is introduced in this work. Evaluation of the proposed system has been done using Voxceleb1 and Part1 of the RSR 2015 databases. Experimental results show a relative improvement of ~ 16% (accuracy—96.21%) with spectrogram and ~ 10% (accuracy—98.92%) with log Mel spectrogram in identifying the speaker compared to the existing models. When cochleagram was used, it results in an identification accuracy of 99.26%. Analyzing the result obtained shows the applicability of the proposed approach in situations where (i) minimal speech data are available for speaker identification; (ii) speech data are noisy in nature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Article 02 August 2023

Intelligent Speaker Identification System Under Multi-Variability Speech Conditions

Deep Learning Approaches for Speech Analysis: A Critical Insight

Data Availability

No associated data.

Notes

At https://github.com/banalasaritha/Time–Frequency-Representations-of-RSR2015-Database.

References:

P.K. Ajmera, D.V. Jadhav, R.S. Holambe, Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011). https://doi.org/10.1016/j.patcog.2011.04.009
Article ADS Google Scholar
N.N. An, N.Q. Thanh, Y. Liu, Deep CNNs with self-attention for speaker identification. IEEE Access 7, 85327–85337 (2019). https://doi.org/10.1109/ACCESS.2019.2917470
Article Google Scholar
T. Arias-Vergara, P. Klumpp, J.C. Vasquez-Correa, E. Nöth, J.R. Orozco-Arroyave, M. Schuster, Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal. Appl. 24(2), 423–431 (2021). https://doi.org/10.1007/s10044-020-00921-5
Article Google Scholar
A. Ashar, M.S. Bhatti, U. Mushtaq, Speaker identification using a hybrid CNN-MFCC approach, in 2020 International Conference on Emerging Trends in Smart Technologies ICETST 2020, 2020. https://doi.org/10.1109/ICETST49965.2020.9080730
H. Beigi, Speaker recognition: advancements and challenges, in New Trends Dev. Biometrics, pp. 3–30, 2012. https://doi.org/10.5772/52023
S. Bunrit, T. Inkian, N. Kerdprasop, K. Kerdprasop, Text-independent speaker identification using deep learning model of convolution neural network. Int. J. Mach. Learn. Comput. 9(2), 143–148 (2019). https://doi.org/10.18178/ijmlc.2019.9.2.778
Article Google Scholar
W. Cai, J. Chen, M. Li, Exploring the encoding layer and loss function in end-to-end speaker and language recognition system, in The Speaker and Language Recognition Workshop (Odyssey 2018), pp. 74–81, 2018. https://doi.org/10.21437/Odyssey.2018-11
G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, No. i, pp. 4087–4091, 2014. https://doi.org/10.1109/ICASSP.2014.6854370
S. Ding, T. Chen, X. Gong, W. Zha, Z. Wang, AutoSpeech: Neural architecture search for speaker recognition, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020-October, pp. 916–920, 2020. https://doi.org/10.21437/Interspeech.2020-1258
S.A. El-Moneim, M.A. Nassar, M.I. Dessouky, N.A. Ismail, A.S. El-Fishawy, F.E. Abd El-Samie, Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 79(33–34), 24013–24028 (2020). https://doi.org/10.1007/s11042-019-08293-7
Article Google Scholar
S. Farsiani, H. Izadkhah, S. Lotfi, An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 100, 107882 (2022). https://doi.org/10.1016/j.compeleceng.2022.107882
Article Google Scholar
M. Hajibabaei, D. Dai, Unified hypersphere embedding for speaker recognition. Electr. Eng. Syst. Sci. Audio Speech Process. (2018). https://doi.org/10.48550/arXiv.1807.08312
Article Google Scholar
M.R. Hasan, M.M. Hasan, M.Z. Hossain, How many Mel-frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language. J. Eng. 2021(12), 817–827 (2021). https://doi.org/10.1049/tje2.12082
Article Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 45, no. 8, pp. 770–778, 2016. https://doi.org/10.1109/CVPR.2016.90
R. Jahangir, Y.W. Teh, H.F. Nweke, G. Mujtaba, M.A. Al-Garadi, I. Ali, Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Syst. Appl. 171, 114591 (2021). https://doi.org/10.1016/j.eswa.2021.114591
Article Google Scholar
J.W. Jung, H.S. Heo, J.H. Kim, H.J. Shim, H.J. Yu, RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 1268–1272, 2019. https://doi.org/10.21437/Interspeech.2019-1982
J.W. Jung, H.S. Heo, I.H. Yang, H.J. Shim, H.J. Yu, A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2018-April, pp. 5349–5353, 2018. https://doi.org/10.1109/ICASSP.2018.8462575
M. Karu, T. Alumäe, Weakly supervised training of speaker identification models, in Speak. Lang. Recognit. Work. ODYSSEY 2018, pp. 24–30, 2018. https://doi.org/10.21437/Odyssey.2018-4
A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
Article Google Scholar
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014). https://doi.org/10.1016/j.specom.2014.03.001
Article Google Scholar
M.A. Laskar, R.H. Laskar, Integrating DNN–HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38(8), 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3
Article Google Scholar
M.A. Laskar, R.H. Laskar, HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification. Expert Syst. Appl. 182, 115281 (2021). https://doi.org/10.1016/j.eswa.2021.115281
Article Google Scholar
Q.V. Le, M. Tan, EfficientNet: rethinking model scaling for convolutional neural networks. Can. J. Emerg. Med. 15(3), 190 (2013). https://doi.org/10.48550/arXiv.1905.11946
Article ADS Google Scholar
J. Lee, H.G. Kang, Two-stage refinement of magnitude and complex spectra for real-time speech enhancement. IEEE Signal Process. Lett. 29, 2188–2192 (2022). https://doi.org/10.1109/LSP.2022.3215100
Article ADS Google Scholar
T. Matsui, S. Furui, Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMM’s. IEEE Trans. Speech Audio Process. 2(3), 456–459 (1994). https://doi.org/10.1109/89.294363
Article Google Scholar
H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007
Article Google Scholar
A. Nagraniy, J.S. Chungy, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-August, pp. 2616–2620, 2017. https://doi.org/10.21437/Interspeech.2017-950
P.K. Nayana, D. Mathew, A. Thomas, Comparison of text independent speaker identification systems using GMM and i-vector methods. Procedia Comput. Sci. 115, 47–54 (2017). https://doi.org/10.1016/j.procs.2017.09.075
Article Google Scholar
I. Orović, S. Stanković, N. Žarić, Robust speech watermarking procedure in the time-frequency domain. EURASIP J. Adv. Signal Process. 2008, 1–9 (2008). https://doi.org/10.1155/2008/519206
Article Google Scholar
R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, M. Allerhand, Complex sounds and auditory images, in Auditory Physiology and Perception, vol 83 (Elsevier, 1992), pp. 429–446. https://doi.org/10.1016/B978-0-08-041847-6.50054-X
S. Pruzansky, Pattern-matching procedure for automatic talker recognition. J. Acoust. Soc. Am. 35(3), 354–358 (1963). https://doi.org/10.1121/1.1918467
Article ADS Google Scholar
M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in 2018 IEEE spoken language technology workshop SLT 2018, pp. 1021–1028, 2019. https://doi.org/10.1109/SLT.2018.8639585
D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, in ESCA Work. Autom. Speak. Recognition, Identification, Verif. ASRIV 1994, vol. 17, pp. 27–30, 2019. https://doi.org/10.1016/0167-6393(95)00009-D
T.N. Sainath, B. Kingsbury, G. Saon, H. Soltau, G. Dahl, B. Ramabhadran, Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 64, 39–48 (2015). https://doi.org/10.1016/j.neunet.2014.08.005
Article PubMed Google Scholar
D. Salvati, C. Drioli, G.L. Foresti, End-to-end speaker identification in noisy and reverberant environments using raw waveform convolutional neural networks, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 4335–4339, 2019. https://doi.org/10.21437/Interspeech.2019-2403
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: inverted residuals and linear bottlenecks, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018. https://doi.org/10.1109/CVPR.2018.00474
B. Saritha, N. Shome, R.H. Laskar, M. Choudhury, Enhancement in speaker recognition using sincnet through optimal window and frame shift, in 2022 2nd International Conference on Intelligent Technologies (CONIT), pp. 1–6, 2022. https://doi.org/10.1109/CONIT55038.2022.9848231
R.V. Sharan, T.J. Moir, Subband time-frequency image texture features for robust audio surveillance. IEEE Trans. Inf. Forensics Secur. 10(12), 2605–2615 (2015). https://doi.org/10.1109/TIFS.2015.2469254
Article Google Scholar
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–14, 2015. https://doi.org/10.48550/arXiv.1409.1556
H. Sinha, V. Awasthi, P.K. Ajmera, Audio classification using braided convolutional neural networks. IET Signal Process. 14(7), 448–454 (2020). https://doi.org/10.1049/iet-spr.2019.0381
Article Google Scholar
L. Stankovic, A method for time-frequency analysis. IEEE Trans. Signal Process. 42(1), 225–229 (1994). https://doi.org/10.1109/78.258146
Article ADS MathSciNet Google Scholar
S. Stanković, Time-frequency analysis and its application in digital watermarking. EURASIP J. Adv. Signal Process. 2010, 1–20 (2010). https://doi.org/10.1155/2010/579295
Article Google Scholar
M. Strake, B. Defraene, K. Fluyt, W. Tirry, T. Fingscheidt, Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration. EURASIP J. Adv. Signal Process. 2020(1), 1–26 (2020). https://doi.org/10.1186/s13634-020-00707-1
Article ADS Google Scholar
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 91, no. 8, pp. 1–9, 2015. https://doi.org/10.1109/CVPR.2015.7298594
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the Inception Architecture for Computer Vision, in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 2016-December, pp. 2818–2826, 2016. https://doi.org/10.1109/CVPR.2016.308
S.S. Tirumala, S.R. Shahamiri, A deep autoencoder approach for speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 175–179, 2017. https://doi.org/10.1145/3163080.3163097
S.S. Tirumala, S.R. Shahamiri, A review on deep learning approaches in speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 142–147, 2016. https://doi.org/10.1145/3015166.3015210
Tobias Birnbaum (2023), s_method_pub, https://github.com/PhaseSpaceContinuum/S_method.
Z. Wu, Z. Cao, Improved MFCC-based feature for robust speaker identification. Tsinghua Sci. Technol. 10(2), 158–161 (2005). https://doi.org/10.1016/S1007-0214(05)70048-1
Article Google Scholar
S. Yadav, A. Rai, Learning discriminative features for speaker identification and verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-September, no. April, pp. 2237–2241, 2018. https://doi.org/10.21437/Interspeech.2018-1015
Z. Zhang, J. Geiger, J. Pohjalainen, A.E.D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 1–28 (2018). https://doi.org/10.1145/3178115
Article Google Scholar
X. Zhao, D. Wang, Analyzing noise robustness of MFCC and GFCC features in speaker identification, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208, 2013. https://doi.org/10.1109/ICASSP.2013.6639061
B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures for scalable image recognition, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2018. https://doi.org/10.1109/CVPR.2018.00907

Download references

Acknowledgements

The authors thank the members of Speech and Image Processing Laboratory, National Institute of Technology Silchar for supporting the research. We thank the Editor-in–chief and the anonymous reviewers for all the valuable suggestions.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Silchar, Silchar, Assam, India
Banala Saritha, Mohammad Azharuddin Laskar, Anish Monsley Kirupakaran, Rabul Hussain Laskar & Madhuchhanda Choudhury
Department of Electronics and Communication Engineering, Assam University, Silchar, Assam, India
Nirupam Shome

Authors

Banala Saritha
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Azharuddin Laskar
View author publications
You can also search for this author in PubMed Google Scholar
Anish Monsley Kirupakaran
View author publications
You can also search for this author in PubMed Google Scholar
Rabul Hussain Laskar
View author publications
You can also search for this author in PubMed Google Scholar
Madhuchhanda Choudhury
View author publications
You can also search for this author in PubMed Google Scholar
Nirupam Shome
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Banala Saritha.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Saritha, B., Laskar, M.A., Kirupakaran, A.M. et al. Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal. Circuits Syst Signal Process 43, 1839–1861 (2024). https://doi.org/10.1007/s00034-023-02542-9

Download citation

Received: 18 January 2023
Revised: 13 October 2023
Accepted: 14 October 2023
Published: 17 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00034-023-02542-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Intelligent Speaker Identification System Under Multi-Variability Speech Conditions

Deep Learning Approaches for Speech Analysis: A Critical Insight

Data Availability

Notes

References:

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Intelligent Speaker Identification System Under Multi-Variability Speech Conditions

Deep Learning Approaches for Speech Analysis: A Critical Insight

Data Availability

Notes

References:

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation