Speaker verification under degraded condition: a perceptual study

Article

Abstract

This study analyzes the effect of degradation on human and automatic speaker verification (SV) tasks. The perceptual test is conducted by the subjects having knowledge about speaker verification. An automatic SV system is developed using the Mel-frequency cepstral coefficients (MFCC) and Gaussian mixture model (GMM). The human and automatic speaker verification performances are compared for clean train and different degraded test conditions. Speech signals are reconstructed in clean and degraded conditions by highlighting different speaker specific information and compared through perceptual test. The perceptual cues that the human subjects used as speaker specific information are investigated and their importance in degraded condition is highlighted. The difference in the nature of human and automatic SV tasks is investigated in terms of falsely accepted and falsely rejected speech pairs. Speech signals are reconstructed in clean and degraded conditions by highlighting different speaker specific information and compared through perceptual test. A discussion on human vs automatic speaker verification is carried out and the possibility of performance improvement of automatic speaker verification under degraded condition is suggested.

Keywords

Speaker information Speaker verification Degraded condition Human vs automatic 

References

  1. Alexandera, A., Bottib, F., Dessimozb, D., & Drygajlo, A. (2004). The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications. In Forensic Science International (pp. 95–99). Google Scholar
  2. Auckenthaler, R., Carey, M., & Thomas, H. L. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1), 42–54. CrossRefGoogle Scholar
  3. Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-27, 113–120. CrossRefGoogle Scholar
  4. Campbell, J. P. (1997). Speaker recognition: a tutorial. Proceedings of the IEEE, 85(9), 1437–1462. CrossRefGoogle Scholar
  5. Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-28(4), 357–366. CrossRefGoogle Scholar
  6. Haris, B. C., Pradhan, G., Misra, A., Shukla, S., Sinha, R., & Prasanna, S. R. M. (2011). Multi-variability speech database for robust speaker recognition. In National conf. on communication (NCC), Bangalore, India (pp. 1–5). CrossRefGoogle Scholar
  7. Hogg, R. V., & Ledolter, J. (1987). Engineering statistics. New York: Macmillan. Google Scholar
  8. Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52, 12–40. CrossRefGoogle Scholar
  9. Kreiman, J., & Papcun, G. (1991). Comparing discrimination and recognition of unfamiliar voices. Speech Communication, 10, 265–275. CrossRefGoogle Scholar
  10. Ming, J., Hazen, T. J., Glass, J. R., & Reynolds, D. A. (2007). Robust speaker recognition in noisy conditions. IEEE Transactions on Audio, Speech, and Language Processing, 15(5), 1711–1723. CrossRefGoogle Scholar
  11. Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters 13(1), 52–55. CrossRefGoogle Scholar
  12. Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613. CrossRefGoogle Scholar
  13. Murty, K. S. R., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472. CrossRefGoogle Scholar
  14. Nielsen, A. S., & Crystal, T. H. (1998). Human vs. machine speaker identification with telephone speech. In Inter. conf. on spoken language processing, Sydney, Australia (pp. 221–224). Google Scholar
  15. Nielsen, A. S., & Crystal, T. H. (2000). Speaker verification by human listeners: Experiments comparing human and machine performance using the NIST 1998 speaker evaluation data. Digital Signal Processing, 249–266. Google Scholar
  16. Nielsen, A. S., & Stern, K. R. (1986). Recognition of previously unfamiliar speakers as a function of narrowband processing and speaker selection. The Journal of the Acoustical Society of America, 79, 1174–1177. CrossRefGoogle Scholar
  17. NIST (2003). NIST-speaker recognition evaluations. In [Online], Available: http://www.nist.gov/speech/tests/spk.
  18. Pelecanos, J., & Sridharan, S. (2001). Feature warping for robust speaker verification. In Speaker Odessy: the speaker recognition workshop (pp. 213–218). Google Scholar
  19. Prasanna, S. R. M., & Pradhan, G. (2011 in press). Significance of vowel-like regions for speaker verification under degraded condition. IEEE Transactions on Audio, Speech, and Language Processing. Google Scholar
  20. Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17, 91–108. CrossRefGoogle Scholar
  21. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41. CrossRefGoogle Scholar
  22. Teunen, R., Shahshahani, B., & Heck, L. P. (2000). A model-based transformation approach to robust speaker recognition. In Proc. int. conf. on spoken language processing. Beijing, China (Vol. 2, pp. 495–498). Google Scholar
  23. Wang, N., Ching, P. C., Zheng, N., & Lee, T. (2011). Robust speaker recognition using denoised vocal source and vocal tract feature. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 196–205. CrossRefGoogle Scholar
  24. Wu, W., Zheng, T. F., Xu, M., & Soong, F. K. (2007). A cohort-based speaker model synthesis for mismatched channels in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(6), 1893–1903. CrossRefGoogle Scholar
  25. Yegnanarayana, B., Prasanna, S. R. M., Zachariah, J. M., & Gupta, S. (2005). Combining evidence from source suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Transactions on Speech and Audio Processing, 13(4), 575–582. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Department of Electronics and Electrical EngineeringIndian Institute of Technology GuwahatiGuwahatiIndia

Personalised recommendations