Skip to main content
Log in

A focus module-based lightweight end-to-end CNN framework for voiceprint recognition

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The process of identifying a spokesperson from a collection of subsequent time series data is referred to as speaker identification. Convolutional neural networks (CNNs) and deep neural networks are the two types of neural networks that are used in the majority of modern experimental approaches. This work presents a CNN model for speaker identification using a jump-connected one-dimensional convolutional neural network (1-D CNN) with a focus module (FM). The 1-D convolutional layer integrated with FM is employed in the presented model for speaker characteristic extraction and lessens heterogeneity in the temporal and spatial domains, allowing for quicker layer processing. Furthermore, the layered CNN hopping interconnection is employed to overcome the connectivity glitches, and a solution based on softmax loss and smooth L1-norm combined regulation is presented to increase efficiency. The recommended network model was evaluated using the ELSDSR, TIMIT, NIST, 16,000 PCM, and experimental audio datasets. According to experimental data, the equal error rate (EER) of end-to-end CNN for voiceprint identification is 9.02% higher than baseline approaches. In experiments, our proposed speaker recognition (SR) model, which we refer to as the deep FM-1D CNN, had a high recognition accuracy of 99.21%. Moreover, the observations demonstrate that the proposed network model is more robust than other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The authors do not have permission to share data.

References

  1. Beigi, H.: Speaker recognition: advancements and challenges. New Trends Dev. Biometr. 3–29 (2012)

  2. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)

    Article  Google Scholar 

  3. Togneri, R., Pullella, D.: An overview of speaker identification: accuracy and robustness issues. IEEE Circuits Syst. Mag. 11(2), 23–61 (2011)

    Article  Google Scholar 

  4. Li, W.: Speaker Identification from Raw Waveform with LineNet. arXiv preprint arXiv:2105.14826 (2021)

  5. Abdalmalak, K.A., Gallardo-Antolín, A.: Enhancement of a text-independent speaker verification system by using feature combination and parallel structure classifiers. Neural Comput. Appl. 29(3), 637–651 (2018)

    Article  Google Scholar 

  6. Karthikeyan, V., Suja Priyadharsini, S.: Hybrid machine learning classification scheme for speaker identification. J. Forens. Sci. 46(3), 1033–1048 (2022). https://doi.org/10.1111/1556-4029.15006

    Article  Google Scholar 

  7. Wan, L., Wang, Q., Papir, A., & Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)

  8. Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support vector machines for speaker and language recognition. Comput. Speech Lang. 20(2–3), 210–229 (2006)

    Article  Google Scholar 

  9. Shi, W., Shuang, F.: End-to-end convolutional neural network for speaker recognition based on joint supervision. In: 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), pp 385–389. IEEE (2019)

  10. Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  11. Wang, L., Minami, K., Yamamoto, K., Nakagawa, S.: Speaker identification by combining MFCC and phase information in noisy environments. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4502–4505. IEEE (2010)

  12. Gudnason, J., Brookes, M.: Voice source cepstrum coefficients for speaker identification. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4821–4824. IEEE (2008)

  13. Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B., Stauffer, A.: Survey and evaluation of acoustic features for speaker recognition. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5444–5447. IEEE (2011)

  14. Kenny, P., Stafylakis, T., Ouellet, P., Alam, M.J.: JFA-based front ends for speaker recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1705–1709. IEEE (2014)

  15. Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)

    Article  Google Scholar 

  16. Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., Ramos, D.: Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Commun. 59, 69–82 (2014)

    Article  Google Scholar 

  17. Sell, G., Garcia-Romero, D.: Speaker diarization with PLDA i-vector scoring and unsupervised calibration. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 413–417. IEEE (2014)

  18. Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. In: IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307

  19. Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)

  20. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)

  21. Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015)

    Article  Google Scholar 

  22. Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C.L.Y., Courville, A.: Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720 (2017)

  23. Ramoji, S., Krishnan, P., Ganapathy, S.: NPLDA: A deep neural PLDA model for speaker verification. arXiv preprint arXiv:2002.03562 (2020)

  24. Zhang, C., Koishida, K., Hansen, J.H.: Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26(9), 1633–1644 (2018)

    Article  Google Scholar 

  25. Saeed, K., Nammous, M.K.: A speech-and-speaker identification system: feature extraction, description, and classification of speech-signal image. IEEE Trans. Ind. Electron. 54(2), 887–897 (2007)

    Article  Google Scholar 

  26. Xiao, M., Wu, Y., Zuo, G., Fan, S., Yu, H., Shaikh, Z.A., Wen, Z.: Addressing overfitting problem in deep learning-based solutions for next generation data-driven networks. Wirel. Commun. Mob. Comput. (2021)

  27. Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M.A., Al-Amidie, M., Farhan, L.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8(1), 1–74 (2021)

    Article  Google Scholar 

  28. Jain, D., Kumar, A., Garg, G.: Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN. Appl. Soft Comput. 91, 106198 (2020)

    Article  Google Scholar 

  29. Karthikeyan, V., Suja Priyadharsini, S.: Modified layer deep convolution neural network for text-independent speaker recognition. J. Exp. Theor. Artif. Intell. 1–13 (2022)

  30. Brooks, C.: Introductory econometrics for finance, 2nd edn. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  31. Karthikeyan, V., Suja Priyadharsini, S.: A strong hybrid AdaBoost classification algorithm for speaker recognition. Sādhanā 46(3), 1–19 (2021). https://doi.org/10.1007/s12046-021-01649-6

    Article  Google Scholar 

  32. Feng, L.: Speaker recognition. Master's Thesis, Technical University of Denmark, DTU, DK-2800 Kgs,yngby, Denmark (2004)

  33. Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Ling. Data Consort. 1993 (1993)

  34. NIST Multimodal Information Group: 2008 NIST speaker recognition evaluation training set part 1 LDC2011S05. Linguistic Data Consortium, Philadelphia (2011)

Download references

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Contributions

Karthikeyan Velayuthapandian contributed to conceptualisation, methodology/study design, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review and editing, and visualisation. Suja Priyadharsini Subramoniam contributed to conceptualisation, validation, formal analysis, investigation, resources, writing—review and editing, visualisation, and supervision.

Corresponding author

Correspondence to Karthikeyan Velayuthapandian.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Velayuthapandian, K., Subramoniam, S.P. A focus module-based lightweight end-to-end CNN framework for voiceprint recognition. SIViP 17, 2817–2825 (2023). https://doi.org/10.1007/s11760-023-02500-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02500-7

Keywords

Navigation