Advertisement

International Journal of Speech Technology

, Volume 21, Issue 4, pp 975–985 | Cite as

A method to compensate the influence of speech codec in speaker recognition

  • José R. Calvo de Lara
  • Flavio J. Reyes Diaz
  • Gabriel Hernández Sierra
  • Orlando Jimenez Alcazar
Article
  • 35 Downloads

Abstract

The recognition of a person by his voice or “speaker recognition”, is a biometric specialty increasingly used in electronic commerce and electronic banking transactions and forensic investigations, among others. Speaker recognition is supported by the discriminative information contained in the speech of a person and its main challenge is the variability that exists between different speech samples of the same person, used for training and evaluation, or “session variability”. When a speech communication is transmitted over the internet, for example, the coding–decoding process “codec” of the speech causes loss of such information and affects the effectiveness of the speaker recognition. Some methods have been proposed to mitigate this effect. This work makes a study of the degree of affectation of this information for some commonly used codec types and proposes our own solution, to compensate the session variability provoked by the codec. The influence of some types of codec in the quality of the sample was evaluated first with a set of synthesized speech samples. Later, experiments were carried out with speech samples of international competitions, retransmitted over two different codecs, and the effect on the speaker recognition effectiveness was checked. Finally, the variability compensation was applied, with an improvement of the recognition effectiveness, measured by the equal error rate, of 20.8% for the g.722 codec and 27.8% for the gsm 6.20 codec.

Keywords

Speaker recognition Variability compensation Linear discriminant analysis i-Vector representation Voice codec 

References

  1. Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Berlin: Springer.CrossRefGoogle Scholar
  2. Calvo, J. R. (2015). (In Spanish) Métodos de transmisión de voz sobre internet: VoIP. El reconocimiento del locutor en Internet. Technical Report RT078, Blue Serie, CENATAV.Google Scholar
  3. Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRefGoogle Scholar
  4. Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(9),1469–1477.CrossRefGoogle Scholar
  5. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011).). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.CrossRefGoogle Scholar
  6. Dunn, R. B., et al. (2001). Speaker recognition from coded speech in matched and mismatched conditions. In IEEE Odyssey’01 The Speaker and Language Recognition Workshop Proceedings, pp 72–83.Google Scholar
  7. Fernández, L., Wagner, M., & Möller, S. (2012). Analysis of automatic speaker verification performance over different narrowband and wideband telephone channels. In SST’12 Australasian Conference Proceedings, pp. 157–160.Google Scholar
  8. Fernández, L., Wagner, M., & Möller, S. (2014a). Advantages of wideband over narrowband channels for speaker verification employing MFCCs and LFCCs. In ISCA Interspeech Conference Proceedings, pp 1115–1118.Google Scholar
  9. Fernández, L., Wagner, M., & Möller, S. (2014b). Spectral sub-band analysis of speaker verification employing narrowband and wideband speech. IEEE Odyssey’14 The Speaker and Language Recognition Workshop Proceedings, pp 81–87.Google Scholar
  10. Hatch, A. O., Kajarekar, S. S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. ISCA ICSLP’06 Conference Proceedings, pp. 1471–1474.Google Scholar
  11. Hernández, G., Calvo, J. R., Bonastre, J., & Bousquet, P. M. (2014). Session compensation using binary speech representation for speaker recognition. Pattern Recognition Letters, 49, 17–23.CrossRefGoogle Scholar
  12. International Telecommunication Union (2004). ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. https://www.itu.int/rec/T-REC-P.563.
  13. International Telecommunication Union (1996). Recommendation Series, I. T. U. T. P.800: “Methods for subjective determination of transmission quality”. https://www.itu.int/rec/T-REC-P.800.
  14. Jain, A., Flynn, P., & Ross, A. (2007). Handbook of biometrics. Berlin: Springer.Google Scholar
  15. Janicki, A. (2010). SVM-based speaker verification for codec and un-coded speech. EUSIPCO’10 Conference Proceedings, pp 26–30.Google Scholar
  16. Janicki, A., & Staroszczyk, T. (2011). Speaker recognition from coded speech using SVM. TSD’11 Conference Proceedings, LNAI 6836, pp. 291–298.Google Scholar
  17. Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Speaker and session variability in gmm-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448–1460.CrossRefGoogle Scholar
  18. Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. ESCA Eurospeech’97 Conference Proceedings, pp 1895–1898.Google Scholar
  19. McLaren, M., et al. (2013). Improving robustness to compressed speech in speaker recognition. In Proceedings of interspeech, pp. 3698–3701, 2013.Google Scholar
  20. National Institute of Standardization (2008). The 2008 NIST speaker recognition evaluation results. https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results.
  21. Ortega, J., Gonzalez, J., & Marrero, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31, 255–264.CrossRefGoogle Scholar
  22. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10,(1–3), 19–41.CrossRefGoogle Scholar
  23. Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRefGoogle Scholar
  24. Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In IEEE Proceedings of International Conference on technologies for homeland security (HST), pp. 447–452.Google Scholar
  25. Silovsky, J., et al. (2011). Assessment of speaker recognition on lossy codecs used for transmission of speech. In ELMAR’11 Symposium Proceedings, pp. 205–208.Google Scholar
  26. Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In IEEE ICASSP’05 Conference Proceedings, pp 629–632.Google Scholar
  27. Yessad, D., & Amrouche, A. (2014). Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Springer International Journal of Speech Technologies, 17, 43–51.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • José R. Calvo de Lara
    • 1
  • Flavio J. Reyes Diaz
    • 1
  • Gabriel Hernández Sierra
    • 1
  • Orlando Jimenez Alcazar
    • 1
  1. 1.Advanced Technologies Applications Center, CenatavLa HabanaCuba

Personalised recommendations