Abstract
The false accept and false reject are the most vulnerable areas of speaker recognition and speaker authentication process. Speaker verification is an action to verify the existence of a specific speech signal in the collected set of utterances. The target set is prior equipped with the said speaker’s voice signal as well as with other speaker’s voice signal. The other speaker’s utterances are the impostor speakers, present in the target set. The speaker verification is a methodology that follows the one- to-one comparison procedure. It brings a conclusion either true or false, about the existence of the said speaker in the target list. This is an authentication for the existence of a said speaker in the target set. The speaker recognition is a process about the conjecture of association for the said speaker utterance with a sub-list of target speakers. In each type of testing like speaker verification and speaker identification, the role of decision threshold is extremely crucial. In a real-world scenario, the predicted score with respect to decision threshold value needs further rectification to minimize the false accept and false reject. This paper exhibits the environment based thresholds typically; lower-level threshold and upper-level threshold effectively reduce the EER (Equal error rate) during the testing of environment specific voice signals. In the simulation of speaker verification for the robust threshold selection, the system vigorously tested a large set of language-independent voice samples that are collected from the different environments. The performance analysis are conducted by using the ‘Detection Error Tradeoff’ (DET) plots based on the predicted lower-level threshold and upper-level thresholds, obtained for the specific environment. It brings impact over the false acceptance as well as on the false reject in real time testing for environment centric utterance. This work proposes a novel methodology to reduce the equal error rate for environment specific voice signal. This work focus on the environment based thresholds typically; lower-level threshold and upper-level threshold effectively to reduce the EER (Equal error rate) during the testing of environment specific voice signals. This work helps to enhance the audio-visual system use in forensic domain.
Similar content being viewed by others
Data Availability
Enquiries about data availability should be directed to the authors.
References
Lü, Y., Lin, H., Wu, P., & Chen, Y. (2021). Feature compensation based on independent noise estimation for robust speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 22(2021), 1–9. https://doi.org/10.1186/s13636-021-00213-8
Kanrar, S. (2016). Dimension compactness in speaker identification. In proceedings of the international conference on informatics and analytics (ICIA-16). 18, 1–6. https://doi.org/10.1145/2980258.2980296
Reynolds, A. D., Quatieri, F. T., & Dunn, B. Q. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. https://doi.org/10.1006/dspr.1999.0361
Kanrar, S., & Mandal, K. P. (2017). E-health monitoring system enhancement with Gaussian mixture model. Multimedia Tools and Applications, 76(8), 10801–10823. https://doi.org/10.1007/s11042-016-3509-9
Bousquet, P.M., Larcher, A., Matrouf, D., Bonastre, J. F., Plchot,O. (2012). Variance- Spectra based Normalization for I-vector Standard and Probabilistic Linear Discriminate analysis. In Proceedings of Odyssey 2012: The Speaker and Language Recognition Workshop, PP. 157–164.
Martınez, D., Plchot, O., Burget, L., Glembek, O., Matejka. P. (2011). Language recognition in i vectors space. In: proceeding 12th annual conference of the international speech communication association (INTERSPEECH 2011), PP. 861–864
Dehak, N., Torres, A. C., Reynolds, D., Dehak, R (2011). Language recognition via i-vectors and dimensionality reduction. In: proceeding 12th annual conference of the international speech communication association (INTERSPEECH 2011), PP. 857–860.
Kanrar, S., & Mandal, K. N. (2017). Video traffic analytics for large scale surveillance. Multimedia Tools and Applications, 76(11), 13315–13342. https://doi.org/10.1007/s11042-016-3752-0
Bimbot, F., et al. (2004). A tutorial on text-independent speaker verification. EURASIP Journal on Applied signal Processing, 4(101962), 430–451. https://doi.org/10.1155/S1110865704310024
Reynolds, D.A. (2003). Channel robust speaker verification via feature mapping, In: Proceeding, IEEE international conference on acoustics, speech, and signal processing, 2003. proceedings. (ICASSP '03). pp. 53–56. doi:https://doi.org/10.1109/ICASSP.2003.1202292
Xiang, B et al. (2002). Short-time Gaussianization for robust speaker verification. In Proceeding IEEE international conference on acoustics, speech, and signal processing, PP. 681–684. https://doi.org/10.1109/ICASSP.2002.5743809
Reynolds, A. D. (1995). Automatic speaker recognition using Gaussian mixture speaker models. Lincoln Laboratory Journal, 8(2), PP.173–192
Reynolds, A. D. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108. https://doi.org/10.1016/0167-6393(95)00009-D
Auckenthaler, R., Carey, M., & Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1–3), 42–54. https://doi.org/10.1006/dspr.1999.0360
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of inter speaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988. https://doi.org/10.1109/TASL.2008.925147
Bengio, S., Johnny Mariethoz, J. (2004). The expected performance curve: a new assessment measure for person authentication. In: Proceedings ODYSSEY04, the speaker and language recognition workshop, PP. 279–284.
Mirghafori, N., Heck, L. (2002). An adaptive speaker verification system with speaker dependent a priori decision thresholds. In Proceeding INTERSPEECH 2002 (ICSLP), Corpus ID: 17706484, PP. 589-592
Wang, R., Juefei-Xu, F., Huang, Y., Guo, Q., Xie, X., Ma, L., Liu, Y. (2020). DeepSonar: towards effective and robust detection of ai-synthesized fake voices. In Proceeding MM '20: proceedings of the 28th ACM international conference on multimedia, 1207–1216. https://doi.org/10.1145/3394171.3413716
Ogawa, A., Hori, T., & Nakamura, A. (2016). Estimating speech recognition accuracy based on error type classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(12), 2400–2413. https://doi.org/10.1109/TASLP.2016.2603599
Touazi, A., & Debyeche, M. (2017). An experimental framework for Arabic digits speech recognition in noisy environments. International Journal of Speech Technology, 20(2), 205–224. https://doi.org/10.1007/s10772-017-9400-x
Seltzer, L. M., Raj, B., & Stern, M. R. (2004). A Bayesian classifier for spectrographic mask estimation for missing feature speech. Speech Communication, 43(4), 379–393. https://doi.org/10.1016/j.specom.2004.03.006
Blei, M. D., Alp, K., & McAuliffe, D. J. (2017). Variational inference: a review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. https://doi.org/10.1080/01621459.2017.1285773
Juang, B., Hou, W., & Lee, C. (1997). Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, 5(3), 257–265. https://doi.org/10.1109/89.568732
Saha, P., Baruah, U., Laskar, R. H., Mishra, S., Choudhury, P. S., & Das, K. T. (2016). Robust analysis for improvement of vowel onset point detection under noisy conditions. International Journal of Speech Technology, 19(3), 433–448. https://doi.org/10.1007/s10772-016-9336-6
Apsingeker, V. R., & De Leon, L. P. (2009). Speaker model clustering for efficient speaker identification in large population applications. IEEE Transactions on Audio Speech Language Processing, 17(4), 848–853. https://doi.org/10.1109/TASL.2008.2010882
Martin, F. A., Doddington, R. G., Kamm, T., Ordowski, M., Przybocki, A. M. (1997). The DET curve in assessment of detection task performance. In Proceeding of EUROSPEECH, ISCA, 1997, Accession Number: ADA530509. PP. 1–5
Pang, X., & Mak, M. (2015). Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA. International Journal of Speech Technology, 18(4), 633–648. https://doi.org/10.1007/s10772-015-9310-8
Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics, 44(2), 175–184. https://doi.org/10.1109/TCYB.2013.2250954
Kanrar, S., & Mandal, K. P. (2015). Detect mimicry by enhancing the speaker recognition system. Springer AISC, 339, 21–31. https://doi.org/10.1007/978-81-322-2250-7_3
Kanrar, S., Jaiswal, N. (2015). Text and language independent speaker identification GMM Based i Vector. In Proceedings of the Sixth International Conference on Computer and Communication Technology (ICCT 15-ACM), PP. 95–100. https://doi.org/10.1145/2818567.2818585
Kanrar, S. (2015).Impact of threshold to identify vocal tract. In Proceedings advances in intelligent systems and computing (AISC, volume 404), PP. 97–105, https://doi.org/10.1007/978-81-322-2695-6_9
Sadeghi, M., Simon Leglaive, S., Alameda-Pineda, X., Girin, L., & Horaud, R. (2020). Audio-visual speech enhancement using conditional variational auto-encoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1788–1800. https://doi.org/10.1109/TASLP.2020.3000593
Leglaive, S., Simsekli, U., Liutkus, A., Girin, L., Horaud, R. (2019). Speech enhancement with variational autoencoders and alpha-stable distributions. In Proceeding of IEEE international conference on acoustics speech, and signal processing (ICASSP 2019), 541–545. https://doi.org/10.1109/ICASSP.2019.8682546
Jain, A. K., Ross, A., & Prabhakar, S. (2004). An introduction to biometric recognition. IEEE Transaction on Circuits and system for Video Technology, 14(1), 4–20. https://doi.org/10.1109/TCSVT.2003.818349
Zigel, Y., Wasserblat, M. (2006). How to deal with multiple-targets in speaker identification systems? In proceeding 2006 IEEE odyssey-the speaker and language recognition workshop 2006, PP.1–7. https://doi.org/10.1109/ODYSSEY.2006.248116
Zakariah, M., Khurram, K. M., & Malik, H. (2018). Digital multimedia audio forensics: past, present and future. Multimedia Tools and Applications, 77(1), 1009–1040. https://doi.org/10.1007/s11042-016-4277-2
Malik, H. (2013). Acoustic environment identification and its applications to audio forensics. IEEE Transactions on Information Forensics and Security, 8(11), 1827–1837. https://doi.org/10.1109/TIFS.2013.2280888
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transaction on Audio Speech Language Processing, 19(4), 788–798. https://doi.org/10.1109/TASL.2010.2064307
Morrison, G. S. (2011). Measuring the validity and reliability of forensic likelihood–ratio systems. Science & Justice, 5(3), 91–98. https://doi.org/10.1016/j.scijus.2011.03.002
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–2144. https://doi.org/10.1109/TASL.2006.881693
Przybocki, M., Martin, F. A (2004). NIST speaker recognition evaluation chronicles. In: Proceedings of odyssey 2004, the speaker and language recognition workshop, PP. 15–22.
Funding
The authors have not disclosed any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declare that there is no conflict of interest.
Ethical Approval
This manuscript does not contain any study performed with humans or animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kanrar, S. Robust Threshold Selection for Environment Specific Voice in Speaker Recognition. Wireless Pers Commun 126, 3071–3092 (2022). https://doi.org/10.1007/s11277-022-09852-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-022-09852-2