Advertisement

International Journal of Speech Technology

, Volume 18, Issue 4, pp 633–648 | Cite as

Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA

  • Xiaomin Pang
  • Man-Wai MakEmail author
Article
  • 138 Downloads

Abstract

While i-vectors with probabilistic linear discriminant analysis (PLDA) can achieve state-of-the-art performance in speaker verification, the mismatch caused by acoustic noise remains a key factor affecting system performance. In this paper, a fusion system that combines a multi-condition signal-to-noise ratio (SNR)-independent PLDA model and a mixture of SNR-dependent PLDA models is proposed to make speaker verification systems more noise robust. First, the whole range of SNR that a verification system is expected to operate is divided into several narrow ranges. Then, a set of SNR-dependent PLDA models, one for each narrow SNR range, are trained. During verification, the SNR of the test utterance is used to determine which of the SNR-dependent PLDA models is used for scoring. To further enhance performance, the SNR-dependent and SNR-independent models are fused using linear and logistic regression fusion. The performance of the fusion system and the SNR-dependent system is evaluated on the NIST 2012 speaker recognition evaluation for both noisy and clean conditions. Results show that a mixture of SNR-dependent PLDA models perform better in both clean and noisy conditions. It was also found that the fusion system is more robust than the conventional i-vector/PLDA systems under noisy conditions.

Keywords

Speaker verification i-Vectors Probabilistic LDA NIST 2012 SRE Noise robustness Fusion 

Notes

Acknowledgments

This work was in part supported by The Hong Kong Research Grant Council (Grant Nos. PolyU 152117/14E and PolyU 152068/15E) and The Hong Kong Polytechnic University (Grant No. 4-ZZCX).

References

  1. Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer.zbMATHGoogle Scholar
  2. Brümmer, N., & de Villiers, E. (2011). The Bosaris toolkit user guide: Theory, algorithms and code for binary classifier score processing. Documentation of Bosaris toolkit. https://sites.google.com/site/bosaristoolkit/
  3. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRefGoogle Scholar
  4. Ferrer, L., Bratt, H., Burget, L., Cernocky, H., Glembek, O., Graciarena, M., et al. (2011). Promoting robustness for speaker modeling in the community: The PRISM evaluation set. In Proceedings of NIST 2011 workshop.Google Scholar
  5. Garcia-Romero, D., & Espy-Wilson, C. (2011). Analysis of i-vector length normalization in speaker recognition systems. In Proceedings of interspeech (pp. 249–252).Google Scholar
  6. Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 4257–4260).Google Scholar
  7. Hasan, T., & Hansen, J. (2013). Acoustic factor analysis for robust speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 21(4), 842–853.CrossRefGoogle Scholar
  8. Hasan, T., & Hansen, J. (2014). Maximum likelihood acoustic factor analysis models for robust speaker verification in noise. IEEE Transactions on Audio, Speech, and Language Processing, 22(2), 381–391.CrossRefGoogle Scholar
  9. Hasan, T., Sadjadi, S. O., Liu, G., Shokouhi, N., Boril, H., & Hansen, J. H. L. (2013). CRSS system for 2012 NIST speaker recognition evaluation. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 6783–6787).Google Scholar
  10. Hatch, A., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for SVM-based speaker recognition. In Proceedings of the 9th international conference on spoken language processing, Pittsburgh, PA, USA (pp. 1471–1474).Google Scholar
  11. Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In Proceedings of Odyssey. 2010 Speaker and language recognition workshop. Brno: Czech Republic.Google Scholar
  12. Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1435–1447.CrossRefGoogle Scholar
  13. Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of inter-speaker variability in speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 16(5), 980–988.CrossRefGoogle Scholar
  14. Leeuwen, D. A., & Saeidi, R. (2013). Knowing the non-target speakers: The effect of the i-vector population for PLDA training in speaker recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), Vancouver, BC, Canada (pp. 6778–6782).Google Scholar
  15. Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), Kyoto, Japan (pp. 4253–4256).Google Scholar
  16. Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector Taylor series for speaker recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 6788–6791).Google Scholar
  17. Lei, Y., Mclaren, M., Ferrer, L., & Scheffer, N. (2014). Simplified VTS-based i-vector extraction in noise-robust speaker recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 4065–4069).Google Scholar
  18. Li, Q., & Huang, Y. (2010). Robust speaker identification using an auditory-based feature. In 2010 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 4514–4517).Google Scholar
  19. Mak, M. W., & Yu, H. B. (2013). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer, Speech and Language, 28(1), 295–313.CrossRefGoogle Scholar
  20. Mallidi, S., Ganapathy, S., & Hermansky, H. (2013). Robust speaker recognition using spectro-temporal autoregressive models. In Proceedings of interspeech.Google Scholar
  21. Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of Eurospeech97 (pp. 1895–1898).Google Scholar
  22. Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for i-vector-based noisy speaker recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 4070–4074).Google Scholar
  23. McLaren, M., Mandasari, M., & Leeuwen, D. (2012). Source normalization for language-independent speaker recognition using i-vectors. In Proceedings of Odyssey 2012: The speaker and language recognition workshop (pp. 55–61).Google Scholar
  24. McLaren, M., Scheffer, N., Graciarena, M., Ferrer, L., & Lei, Y. (2013). Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 6773–6777).Google Scholar
  25. Ming, J., Hazen, T., Glass, J., & Reynolds, D. (2007). Robust speaker recognition in noisy conditions. IEEE Transactions on Audio, Speech and Language Processing, 15(5), 1711–1723.CrossRefGoogle Scholar
  26. Neto, S. F. D. C. (1999). The ITU-T software tool library. International Journal of Speech Technology, 2(4), 259–272.CrossRefGoogle Scholar
  27. NIST. (2012). The NIST year 2012 speaker recognition evaluation plan. http://www.nistgov/itl/iad/mig/sre12cfm.
  28. Pang, X. M., & Mak, M. W. (2014). Fusion of SNR-dependent PLDA models for noise robust speaker verification. In ISCSLP2014 (pp. 619–623).Google Scholar
  29. Pelecanos, J., & Sridharan, S. (2001). Feature warping for robust speaker verification. In Proceedings of Odyssey, 2001. The speaker and language recognition workshop, Crete, Greece (pp. 213–218).Google Scholar
  30. Prince, S., & Elder, J. (2007). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th international conference on computer vision, 2007 (ICCV 2007, pp. 1–8).Google Scholar
  31. Rajan, P., Afanasyev, A., Hautamäki, V., & Kinnunen, T. (2014). From single to multiple enrollment i-vectors: Practical PLDA scoring variants for speaker verification. Digital Signal Processing Online. doi: 10.1016/j.dsp.2014.05.001.
  32. Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In Proceedings of interspeech (pp. 3694–3697).Google Scholar
  33. Rao, W., & Mak, M. W. (2013). Boosting the performance of i-vector based speaker verification via utterance partitioning. IEEE Transactions on Audio, Speech and Language Processing, 21(5), 1012–1022.CrossRefGoogle Scholar
  34. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.CrossRefGoogle Scholar
  35. Sadjadi, S. O., Hasan, T., & Hansen, J. (2012). Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition. In Proceedings of interspeech (pp. 1696–1699).Google Scholar
  36. Sadjadi, S., Pelecanos, J., & Zhu, W. (2014). Nearest neighbor discriminant analysis for robust speaker recognition. In Proceedings of interspeech (pp. 1860–1864).Google Scholar
  37. Saeidi, R., & van Leeuwen, D. A. (2012). The Radboud University Nijmegen submission to NIST SRE-2012. In Proceedings of the NIST speaker recognition evaluation workshop.Google Scholar
  38. Shao, Y., & Wang, D. (2008). Robust speaker identification using auditory features and computational auditory scene analysis. In 2008 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 1589–1592).Google Scholar
  39. Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP, pp. 4045–4049).Google Scholar
  40. Yu, H., & Mak, M. (2011). Comparison of voice activity detectors for interview speech in NIST speaker recognition evaluation. In Proceedings of interspeech (pp. 2353–2356).Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Centre for Signal Processing, Department of Electronic and Information EngineeringThe Hong Kong Polytechnic UniversityKowloonHong Kong

Personalised recommendations