Skip to main content
Log in

Robust Threshold Selection for Environment Specific Voice in Speaker Recognition

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

The false accept and false reject are the most vulnerable areas of speaker recognition and speaker authentication process. Speaker verification is an action to verify the existence of a specific speech signal in the collected set of utterances. The target set is prior equipped with the said speaker’s voice signal as well as with other speaker’s voice signal. The other speaker’s utterances are the impostor speakers, present in the target set. The speaker verification is a methodology that follows the one- to-one comparison procedure. It brings a conclusion either true or false, about the existence of the said speaker in the target list. This is an authentication for the existence of a said speaker in the target set. The speaker recognition is a process about the conjecture of association for the said speaker utterance with a sub-list of target speakers. In each type of testing like speaker verification and speaker identification, the role of decision threshold is extremely crucial. In a real-world scenario, the predicted score with respect to decision threshold value needs further rectification to minimize the false accept and false reject. This paper exhibits the environment based thresholds typically; lower-level threshold and upper-level threshold effectively reduce the EER (Equal error rate) during the testing of environment specific voice signals. In the simulation of speaker verification for the robust threshold selection, the system vigorously tested a large set of language-independent voice samples that are collected from the different environments. The performance analysis are conducted by using the ‘Detection Error Tradeoff’ (DET) plots based on the predicted lower-level threshold and upper-level thresholds, obtained for the specific environment. It brings impact over the false acceptance as well as on the false reject in real time testing for environment centric utterance. This work proposes a novel methodology to reduce the equal error rate for environment specific voice signal. This work focus on the environment based thresholds typically; lower-level threshold and upper-level threshold effectively to reduce the EER (Equal error rate) during the testing of environment specific voice signals. This work helps to enhance the audio-visual system use in forensic domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

Enquiries about data availability should be directed to the authors.

References

  1. Lü, Y., Lin, H., Wu, P., & Chen, Y. (2021). Feature compensation based on independent noise estimation for robust speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 22(2021), 1–9. https://doi.org/10.1186/s13636-021-00213-8

    Article  Google Scholar 

  2. Kanrar, S. (2016). Dimension compactness in speaker identification. In proceedings of the international conference on informatics and analytics (ICIA-16). 18, 1–6. https://doi.org/10.1145/2980258.2980296

  3. Reynolds, A. D., Quatieri, F. T., & Dunn, B. Q. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. https://doi.org/10.1006/dspr.1999.0361

    Article  Google Scholar 

  4. Kanrar, S., & Mandal, K. P. (2017). E-health monitoring system enhancement with Gaussian mixture model. Multimedia Tools and Applications, 76(8), 10801–10823. https://doi.org/10.1007/s11042-016-3509-9

    Article  Google Scholar 

  5. Bousquet, P.M., Larcher, A., Matrouf, D., Bonastre, J. F., Plchot,O. (2012). Variance- Spectra based Normalization for I-vector Standard and Probabilistic Linear Discriminate analysis. In Proceedings of Odyssey 2012: The Speaker and Language Recognition Workshop, PP. 157–164.

  6. Martınez, D., Plchot, O., Burget, L., Glembek, O., Matejka. P. (2011). Language recognition in i vectors space. In: proceeding 12th annual conference of the international speech communication association (INTERSPEECH 2011), PP. 861–864

  7. Dehak, N., Torres, A. C., Reynolds, D., Dehak, R (2011). Language recognition via i-vectors and dimensionality reduction. In: proceeding 12th annual conference of the international speech communication association (INTERSPEECH 2011), PP. 857–860.

  8. Kanrar, S., & Mandal, K. N. (2017). Video traffic analytics for large scale surveillance. Multimedia Tools and Applications, 76(11), 13315–13342. https://doi.org/10.1007/s11042-016-3752-0

    Article  Google Scholar 

  9. Bimbot, F., et al. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied signal Processing, 4(101962), 430–451. https://doi.org/10.1155/S1110865704310024

    Article  Google Scholar 

  10. Reynolds, D.A. (2003). Channel robust speaker verification via feature mapping, In: Proceeding, IEEE international conference on acoustics, speech, and signal processing, 2003. proceedings. (ICASSP '03). pp. 53–56. doi:https://doi.org/10.1109/ICASSP.2003.1202292

  11. Xiang, B et al. (2002). Short-time Gaussianization for robust speaker verification. In Proceeding IEEE international conference on acoustics, speech, and signal processing, PP. 681–684. https://doi.org/10.1109/ICASSP.2002.5743809

  12. Reynolds, A. D. (1995). Automatic speaker recognition using Gaussian mixture speaker models. Lincoln Laboratory Journal, 8(2), PP.173–192

  13. Reynolds, A. D. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108. https://doi.org/10.1016/0167-6393(95)00009-D

    Article  Google Scholar 

  14. Auckenthaler, R., Carey, M., & Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1–3), 42–54. https://doi.org/10.1006/dspr.1999.0360

    Article  Google Scholar 

  15. Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of inter speaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988. https://doi.org/10.1109/TASL.2008.925147

    Article  Google Scholar 

  16. Bengio, S., Johnny Mariethoz, J. (2004). The expected performance curve: a new assessment measure for person authentication. In: Proceedings ODYSSEY04, the speaker and language recognition workshop, PP. 279–284.

  17. Mirghafori, N., Heck, L. (2002). An adaptive speaker verification system with speaker dependent a priori decision thresholds. In Proceeding INTERSPEECH 2002 (ICSLP), Corpus ID: 17706484, PP. 589-592

  18. Wang, R., Juefei-Xu, F., Huang, Y., Guo, Q., Xie, X., Ma, L., Liu, Y. (2020). DeepSonar: towards effective and robust detection of ai-synthesized fake voices. In Proceeding MM '20: proceedings of the 28th ACM international conference on multimedia, 1207–1216. https://doi.org/10.1145/3394171.3413716

  19. Ogawa, A., Hori, T., & Nakamura, A. (2016). Estimating speech recognition accuracy based on error type classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(12), 2400–2413. https://doi.org/10.1109/TASLP.2016.2603599

    Article  Google Scholar 

  20. Touazi, A., & Debyeche, M. (2017). An experimental framework for Arabic digits speech recognition in noisy environments. International Journal of Speech Technology, 20(2), 205–224. https://doi.org/10.1007/s10772-017-9400-x

    Article  Google Scholar 

  21. Seltzer, L. M., Raj, B., & Stern, M. R. (2004). A Bayesian classifier for spectrographic mask estimation for missing feature speech. Speech Communication, 43(4), 379–393. https://doi.org/10.1016/j.specom.2004.03.006

    Article  Google Scholar 

  22. Blei, M. D., Alp, K., & McAuliffe, D. J. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773

    Article  MathSciNet  Google Scholar 

  23. Juang, B., Hou, W., & Lee, C. (1997). Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, 5(3), 257–265. https://doi.org/10.1109/89.568732

    Article  Google Scholar 

  24. Saha, P., Baruah, U., Laskar, R. H., Mishra, S., Choudhury, P. S., & Das, K. T. (2016). Robust analysis for improvement of vowel onset point detection under noisy conditions. International Journal of Speech Technology, 19(3), 433–448. https://doi.org/10.1007/s10772-016-9336-6

    Article  Google Scholar 

  25. Apsingeker, V. R., & De Leon, L. P. (2009). Speaker model clustering for efficient speaker identification in large population applications. IEEE Transactions on Audio Speech Language Processing, 17(4), 848–853. https://doi.org/10.1109/TASL.2008.2010882

    Article  Google Scholar 

  26. Martin, F. A., Doddington, R. G., Kamm, T., Ordowski, M., Przybocki, A. M. (1997). The DET curve in assessment of detection task performance. In Proceeding of EUROSPEECH, ISCA, 1997, Accession Number: ADA530509. PP. 1–5

  27. Pang, X., & Mak, M. (2015). Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA. International Journal of Speech Technology, 18(4), 633–648. https://doi.org/10.1007/s10772-015-9310-8

    Article  Google Scholar 

  28. Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2), 175–184. https://doi.org/10.1109/TCYB.2013.2250954

    Article  Google Scholar 

  29. Kanrar, S., & Mandal, K. P. (2015). Detect mimicry by enhancing the speaker recognition system. Springer AISC, 339, 21–31. https://doi.org/10.1007/978-81-322-2250-7_3

    Article  Google Scholar 

  30. Kanrar, S., Jaiswal, N. (2015). Text and language independent speaker identification GMM Based i Vector. In Proceedings of the Sixth International Conference on Computer and Communication Technology (ICCT 15-ACM), PP. 95–100. https://doi.org/10.1145/2818567.2818585

  31. Kanrar, S. (2015).Impact of threshold to identify vocal tract. In Proceedings advances in intelligent systems and computing (AISC, volume 404), PP. 97–105, https://doi.org/10.1007/978-81-322-2695-6_9

  32. Sadeghi, M., Simon Leglaive, S., Alameda-Pineda, X., Girin, L., & Horaud, R. (2020). Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1788–1800. https://doi.org/10.1109/TASLP.2020.3000593

    Article  Google Scholar 

  33. Leglaive, S., Simsekli, U., Liutkus, A., Girin, L., Horaud, R. (2019). Speech enhancement with variational autoencoders and alpha-stable distributions. In Proceeding of IEEE international conference on acoustics speech, and signal processing (ICASSP 2019), 541–545. https://doi.org/10.1109/ICASSP.2019.8682546

  34. Jain, A. K., Ross, A., & Prabhakar, S. (2004). An introduction to biometric recognition. IEEE Transaction on Circuits and system for Video Technology, 14(1), 4–20. https://doi.org/10.1109/TCSVT.2003.818349

    Article  Google Scholar 

  35. Zigel, Y., Wasserblat, M. (2006). How to deal with multiple-targets in speaker identification systems? In proceeding 2006 IEEE odyssey-the speaker and language recognition workshop 2006, PP.1–7. https://doi.org/10.1109/ODYSSEY.2006.248116

  36. Zakariah, M., Khurram, K. M., & Malik, H. (2018). Digital multimedia audio forensics: past, present and future. Multimedia Tools and Applications, 77(1), 1009–1040. https://doi.org/10.1007/s11042-016-4277-2

    Article  Google Scholar 

  37. Malik, H. (2013). Acoustic environment identification and its applications to audio forensics. IEEE Transactions on Information Forensics and Security, 8(11), 1827–1837. https://doi.org/10.1109/TIFS.2013.2280888

    Article  Google Scholar 

  38. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transaction on Audio Speech Language Processing, 19(4), 788–798. https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  39. Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood–ratio systems. Science & Justice, 5(3), 91–98. https://doi.org/10.1016/j.scijus.2011.03.002

    Article  Google Scholar 

  40. Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–2144. https://doi.org/10.1109/TASL.2006.881693

    Article  Google Scholar 

  41. Przybocki, M., Martin, F. A (2004). NIST speaker recognition evaluation chronicles. In: Proceedings of odyssey 2004, the speaker and language recognition workshop, PP. 15–22.

Download references

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soumen Kanrar.

Ethics declarations

Conflict of interest

The author declare that there is no conflict of interest.

Ethical Approval

This manuscript does not contain any study performed with humans or animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kanrar, S. Robust Threshold Selection for Environment Specific Voice in Speaker Recognition. Wireless Pers Commun 126, 3071–3092 (2022). https://doi.org/10.1007/s11277-022-09852-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-022-09852-2

Keywords

Navigation