Abstract
Speech-based speaker identification system is one of the alternatives to the conventional biometric contact-based identification systems. Recent works demonstrate the growing interest among researchers in this field and highlight the practical usability of speech for speaker identification across various applications. In this work, we try to address the limitations in the existing state-of-the-art approaches and highlight the usability of convolutional neural networks for speaker identification systems. The present work examines the usage of spectrogram as an input to these spatial networks and its robustness in the presence of noise. For faster training (computation) and to reduce the memory requirement (storage), SpectroNet model for speech-based speaker identification is introduced in this work. Evaluation of the proposed system has been done using Voxceleb1 and Part1 of the RSR 2015 databases. Experimental results show a relative improvement of ~ 16% (accuracy—96.21%) with spectrogram and ~ 10% (accuracy—98.92%) with log Mel spectrogram in identifying the speaker compared to the existing models. When cochleagram was used, it results in an identification accuracy of 99.26%. Analyzing the result obtained shows the applicability of the proposed approach in situations where (i) minimal speech data are available for speaker identification; (ii) speech data are noisy in nature.
Similar content being viewed by others
Data Availability
No associated data.
References:
P.K. Ajmera, D.V. Jadhav, R.S. Holambe, Text-independent speaker identification using Radon and discrete cosine transforms based features from speech spectrogram. Pattern Recognit. 44(10–11), 2749–2759 (2011). https://doi.org/10.1016/j.patcog.2011.04.009
N.N. An, N.Q. Thanh, Y. Liu, Deep CNNs with self-attention for speaker identification. IEEE Access 7, 85327–85337 (2019). https://doi.org/10.1109/ACCESS.2019.2917470
T. Arias-Vergara, P. Klumpp, J.C. Vasquez-Correa, E. Nöth, J.R. Orozco-Arroyave, M. Schuster, Multi-channel spectrograms for speech processing applications using deep learning methods. Pattern Anal. Appl. 24(2), 423–431 (2021). https://doi.org/10.1007/s10044-020-00921-5
A. Ashar, M.S. Bhatti, U. Mushtaq, Speaker identification using a hybrid CNN-MFCC approach, in 2020 International Conference on Emerging Trends in Smart Technologies ICETST 2020, 2020. https://doi.org/10.1109/ICETST49965.2020.9080730
H. Beigi, Speaker recognition: advancements and challenges, in New Trends Dev. Biometrics, pp. 3–30, 2012. https://doi.org/10.5772/52023
S. Bunrit, T. Inkian, N. Kerdprasop, K. Kerdprasop, Text-independent speaker identification using deep learning model of convolution neural network. Int. J. Mach. Learn. Comput. 9(2), 143–148 (2019). https://doi.org/10.18178/ijmlc.2019.9.2.778
W. Cai, J. Chen, M. Li, Exploring the encoding layer and loss function in end-to-end speaker and language recognition system, in The Speaker and Language Recognition Workshop (Odyssey 2018), pp. 74–81, 2018. https://doi.org/10.21437/Odyssey.2018-11
G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, No. i, pp. 4087–4091, 2014. https://doi.org/10.1109/ICASSP.2014.6854370
S. Ding, T. Chen, X. Gong, W. Zha, Z. Wang, AutoSpeech: Neural architecture search for speaker recognition, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020-October, pp. 916–920, 2020. https://doi.org/10.21437/Interspeech.2020-1258
S.A. El-Moneim, M.A. Nassar, M.I. Dessouky, N.A. Ismail, A.S. El-Fishawy, F.E. Abd El-Samie, Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed. Tools Appl. 79(33–34), 24013–24028 (2020). https://doi.org/10.1007/s11042-019-08293-7
S. Farsiani, H. Izadkhah, S. Lotfi, An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 100, 107882 (2022). https://doi.org/10.1016/j.compeleceng.2022.107882
M. Hajibabaei, D. Dai, Unified hypersphere embedding for speaker recognition. Electr. Eng. Syst. Sci. Audio Speech Process. (2018). https://doi.org/10.48550/arXiv.1807.08312
M.R. Hasan, M.M. Hasan, M.Z. Hossain, How many Mel-frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language. J. Eng. 2021(12), 817–827 (2021). https://doi.org/10.1049/tje2.12082
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 45, no. 8, pp. 770–778, 2016. https://doi.org/10.1109/CVPR.2016.90
R. Jahangir, Y.W. Teh, H.F. Nweke, G. Mujtaba, M.A. Al-Garadi, I. Ali, Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Syst. Appl. 171, 114591 (2021). https://doi.org/10.1016/j.eswa.2021.114591
J.W. Jung, H.S. Heo, J.H. Kim, H.J. Shim, H.J. Yu, RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 1268–1272, 2019. https://doi.org/10.21437/Interspeech.2019-1982
J.W. Jung, H.S. Heo, I.H. Yang, H.J. Shim, H.J. Yu, A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2018-April, pp. 5349–5353, 2018. https://doi.org/10.1109/ICASSP.2018.8462575
M. Karu, T. Alumäe, Weakly supervised training of speaker identification models, in Speak. Lang. Recognit. Work. ODYSSEY 2018, pp. 24–30, 2018. https://doi.org/10.21437/Odyssey.2018-4
A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014). https://doi.org/10.1016/j.specom.2014.03.001
M.A. Laskar, R.H. Laskar, Integrating DNN–HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38(8), 3548–3572 (2019). https://doi.org/10.1007/s00034-019-01103-3
M.A. Laskar, R.H. Laskar, HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification. Expert Syst. Appl. 182, 115281 (2021). https://doi.org/10.1016/j.eswa.2021.115281
Q.V. Le, M. Tan, EfficientNet: rethinking model scaling for convolutional neural networks. Can. J. Emerg. Med. 15(3), 190 (2013). https://doi.org/10.48550/arXiv.1905.11946
J. Lee, H.G. Kang, Two-stage refinement of magnitude and complex spectra for real-time speech enhancement. IEEE Signal Process. Lett. 29, 2188–2192 (2022). https://doi.org/10.1109/LSP.2022.3215100
T. Matsui, S. Furui, Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMM’s. IEEE Trans. Speech Audio Process. 2(3), 456–459 (1994). https://doi.org/10.1109/89.294363
H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7, 125868–125881 (2019). https://doi.org/10.1109/ACCESS.2019.2938007
A. Nagraniy, J.S. Chungy, A. Zisserman, VoxCeleb: A large-scale speaker identification dataset, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-August, pp. 2616–2620, 2017. https://doi.org/10.21437/Interspeech.2017-950
P.K. Nayana, D. Mathew, A. Thomas, Comparison of text independent speaker identification systems using GMM and i-vector methods. Procedia Comput. Sci. 115, 47–54 (2017). https://doi.org/10.1016/j.procs.2017.09.075
I. Orović, S. Stanković, N. Žarić, Robust speech watermarking procedure in the time-frequency domain. EURASIP J. Adv. Signal Process. 2008, 1–9 (2008). https://doi.org/10.1155/2008/519206
R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, M. Allerhand, Complex sounds and auditory images, in Auditory Physiology and Perception, vol 83 (Elsevier, 1992), pp. 429–446. https://doi.org/10.1016/B978-0-08-041847-6.50054-X
S. Pruzansky, Pattern-matching procedure for automatic talker recognition. J. Acoust. Soc. Am. 35(3), 354–358 (1963). https://doi.org/10.1121/1.1918467
M. Ravanelli, Y. Bengio, Speaker recognition from raw waveform with SincNet, in 2018 IEEE spoken language technology workshop SLT 2018, pp. 1021–1028, 2019. https://doi.org/10.1109/SLT.2018.8639585
D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, in ESCA Work. Autom. Speak. Recognition, Identification, Verif. ASRIV 1994, vol. 17, pp. 27–30, 2019. https://doi.org/10.1016/0167-6393(95)00009-D
T.N. Sainath, B. Kingsbury, G. Saon, H. Soltau, G. Dahl, B. Ramabhadran, Deep convolutional neural networks for large-scale speech tasks. Neural Netw. 64, 39–48 (2015). https://doi.org/10.1016/j.neunet.2014.08.005
D. Salvati, C. Drioli, G.L. Foresti, End-to-end speaker identification in noisy and reverberant environments using raw waveform convolutional neural networks, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019-September, pp. 4335–4339, 2019. https://doi.org/10.21437/Interspeech.2019-2403
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: inverted residuals and linear bottlenecks, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018. https://doi.org/10.1109/CVPR.2018.00474
B. Saritha, N. Shome, R.H. Laskar, M. Choudhury, Enhancement in speaker recognition using sincnet through optimal window and frame shift, in 2022 2nd International Conference on Intelligent Technologies (CONIT), pp. 1–6, 2022. https://doi.org/10.1109/CONIT55038.2022.9848231
R.V. Sharan, T.J. Moir, Subband time-frequency image texture features for robust audio surveillance. IEEE Trans. Inf. Forensics Secur. 10(12), 2605–2615 (2015). https://doi.org/10.1109/TIFS.2015.2469254
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–14, 2015. https://doi.org/10.48550/arXiv.1409.1556
H. Sinha, V. Awasthi, P.K. Ajmera, Audio classification using braided convolutional neural networks. IET Signal Process. 14(7), 448–454 (2020). https://doi.org/10.1049/iet-spr.2019.0381
L. Stankovic, A method for time-frequency analysis. IEEE Trans. Signal Process. 42(1), 225–229 (1994). https://doi.org/10.1109/78.258146
S. Stanković, Time-frequency analysis and its application in digital watermarking. EURASIP J. Adv. Signal Process. 2010, 1–20 (2010). https://doi.org/10.1155/2010/579295
M. Strake, B. Defraene, K. Fluyt, W. Tirry, T. Fingscheidt, Speech enhancement by LSTM-based noise suppression followed by CNN-based speech restoration. EURASIP J. Adv. Signal Process. 2020(1), 1–26 (2020). https://doi.org/10.1186/s13634-020-00707-1
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Going deeper with convolutions, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 91, no. 8, pp. 1–9, 2015. https://doi.org/10.1109/CVPR.2015.7298594
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the Inception Architecture for Computer Vision, in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 2016-December, pp. 2818–2826, 2016. https://doi.org/10.1109/CVPR.2016.308
S.S. Tirumala, S.R. Shahamiri, A deep autoencoder approach for speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 175–179, 2017. https://doi.org/10.1145/3163080.3163097
S.S. Tirumala, S.R. Shahamiri, A review on deep learning approaches in speaker identification, in ACM Int. Conf. Proceeding Ser., pp. 142–147, 2016. https://doi.org/10.1145/3015166.3015210
Tobias Birnbaum (2023), s_method_pub, https://github.com/PhaseSpaceContinuum/S_method.
Z. Wu, Z. Cao, Improved MFCC-based feature for robust speaker identification. Tsinghua Sci. Technol. 10(2), 158–161 (2005). https://doi.org/10.1016/S1007-0214(05)70048-1
S. Yadav, A. Rai, Learning discriminative features for speaker identification and verification, in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-September, no. April, pp. 2237–2241, 2018. https://doi.org/10.21437/Interspeech.2018-1015
Z. Zhang, J. Geiger, J. Pohjalainen, A.E.D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 1–28 (2018). https://doi.org/10.1145/3178115
X. Zhao, D. Wang, Analyzing noise robustness of MFCC and GFCC features in speaker identification, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208, 2013. https://doi.org/10.1109/ICASSP.2013.6639061
B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures for scalable image recognition, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2018. https://doi.org/10.1109/CVPR.2018.00907
Acknowledgements
The authors thank the members of Speech and Image Processing Laboratory, National Institute of Technology Silchar for supporting the research. We thank the Editor-in–chief and the anonymous reviewers for all the valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Saritha, B., Laskar, M.A., Kirupakaran, A.M. et al. Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal. Circuits Syst Signal Process 43, 1839–1861 (2024). https://doi.org/10.1007/s00034-023-02542-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02542-9