Abstract
The process of identifying a spokesperson from a collection of subsequent time series data is referred to as speaker identification. Convolutional neural networks (CNNs) and deep neural networks are the two types of neural networks that are used in the majority of modern experimental approaches. This work presents a CNN model for speaker identification using a jump-connected one-dimensional convolutional neural network (1-D CNN) with a focus module (FM). The 1-D convolutional layer integrated with FM is employed in the presented model for speaker characteristic extraction and lessens heterogeneity in the temporal and spatial domains, allowing for quicker layer processing. Furthermore, the layered CNN hopping interconnection is employed to overcome the connectivity glitches, and a solution based on softmax loss and smooth L1-norm combined regulation is presented to increase efficiency. The recommended network model was evaluated using the ELSDSR, TIMIT, NIST, 16,000 PCM, and experimental audio datasets. According to experimental data, the equal error rate (EER) of end-to-end CNN for voiceprint identification is 9.02% higher than baseline approaches. In experiments, our proposed speaker recognition (SR) model, which we refer to as the deep FM-1D CNN, had a high recognition accuracy of 99.21%. Moreover, the observations demonstrate that the proposed network model is more robust than other models.
Similar content being viewed by others
Availability of data and materials
The authors do not have permission to share data.
References
Beigi, H.: Speaker recognition: advancements and challenges. New Trends Dev. Biometr. 3–29 (2012)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Togneri, R., Pullella, D.: An overview of speaker identification: accuracy and robustness issues. IEEE Circuits Syst. Mag. 11(2), 23–61 (2011)
Li, W.: Speaker Identification from Raw Waveform with LineNet. arXiv preprint arXiv:2105.14826 (2021)
Abdalmalak, K.A., Gallardo-Antolín, A.: Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers. Neural Comput. Appl. 29(3), 637–651 (2018)
Karthikeyan, V., Suja Priyadharsini, S.: Hybrid machine learning classification scheme for speaker identification. J. Forens. Sci. 46(3), 1033–1048 (2022). https://doi.org/10.1111/1556-4029.15006
Wan, L., Wang, Q., Papir, A., & Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support vector machines for speaker and language recognition. Comput. Speech Lang. 20(2–3), 210–229 (2006)
Shi, W., Shuang, F.: End-to-end convolutional neural network for speaker recognition based on joint supervision. In: 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp 385–389. IEEE (2019)
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)
Wang, L., Minami, K., Yamamoto, K., Nakagawa, S.: Speaker identification by combining MFCC and phase information in noisy environments. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4502–4505. IEEE (2010)
Gudnason, J., Brookes, M.: Voice source cepstrum coefficients for speaker identification. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4821–4824. IEEE (2008)
Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B., Stauffer, A.: Survey and evaluation of acoustic features for speaker recognition. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5444–5447. IEEE (2011)
Kenny, P., Stafylakis, T., Ouellet, P., Alam, M.J.: JFA-based front ends for speaker recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1705–1709. IEEE (2014)
Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)
Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., Ramos, D.: Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Commun. 59, 69–82 (2014)
Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 413–417. IEEE (2014)
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. In: IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)
Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015)
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C.L.Y., Courville, A.: Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720 (2017)
Ramoji, S., Krishnan, P., Ganapathy, S.: NPLDA: A deep neural PLDA model for speaker verification. arXiv preprint arXiv:2002.03562 (2020)
Zhang, C., Koishida, K., Hansen, J.H.: Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26(9), 1633–1644 (2018)
Saeed, K., Nammous, M.K.: A speech-and-speaker identification system: feature extraction, description, and classification of speech-signal image. IEEE Trans. Ind. Electron. 54(2), 887–897 (2007)
Xiao, M., Wu, Y., Zuo, G., Fan, S., Yu, H., Shaikh, Z.A., Wen, Z.: Addressing overfitting problem in deep learning-based solutions for next generation data-driven networks. Wirel. Commun. Mob. Comput. (2021)
Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M.A., Al-Amidie, M., Farhan, L.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8(1), 1–74 (2021)
Jain, D., Kumar, A., Garg, G.: Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl. Soft Comput. 91, 106198 (2020)
Karthikeyan, V., Suja Priyadharsini, S.: Modified layer deep convolution neural network for text-independent speaker recognition. J. Exp. Theor. Artif. Intell. 1–13 (2022)
Brooks, C.: Introductory econometrics for finance, 2nd edn. Cambridge University Press, Cambridge (2008)
Karthikeyan, V., Suja Priyadharsini, S.: A strong hybrid AdaBoost classification algorithm for speaker recognition. Sādhanā 46(3), 1–19 (2021). https://doi.org/10.1007/s12046-021-01649-6
Feng, L.: Speaker recognition. Master's Thesis, Technical University of Denmark, DTU, DK-2800 Kgs,yngby, Denmark (2004)
Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Ling. Data Consort. 1993 (1993)
NIST Multimodal Information Group: 2008 NIST speaker recognition evaluation training set part 1 LDC2011S05. Linguistic Data Consortium, Philadelphia (2011)
Funding
Not Applicable.
Author information
Authors and Affiliations
Contributions
Karthikeyan Velayuthapandian contributed to conceptualisation, methodology/study design, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review and editing, and visualisation. Suja Priyadharsini Subramoniam contributed to conceptualisation, validation, formal analysis, investigation, resources, writing—review and editing, visualisation, and supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Velayuthapandian, K., Subramoniam, S.P. A focus module-based lightweight end-to-end CNN framework for voiceprint recognition. SIViP 17, 2817–2825 (2023). https://doi.org/10.1007/s11760-023-02500-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02500-7