Speech Coding and Packet Loss Effects on Speech and Speaker Recognition

  • Laurent Besacier
Part of the Advances in Pattern Recognition book series (ACVPR)

This chapter is related to the speech coding and packet loss problems that occur in network speech recognition where speech is transmitted (and most of the time coded) from a client terminal to a recognition server. The first part describes some commonly used speech coding standards and presents a packet loss model useful to evaluate different channel degradation conditions in a controlled fashion. The second part evaluates the influence of different speech and audio codecs on the performance of a continuous speech recognition engine. It is shown that MPEG transcoding degrades the speech recognition performance for low bit rates whereas performance remains acceptable for specialized speech coders like G723. The same system is also evaluated for different simulated and real packet loss conditions; in that case, the significant degradation of the automatic speech recognition (ASR) performance is analyzed. The third part presents an overview of joint compression and packet loss effects on speech biometrics. Conversely to the ASR task, it is experimentally demonstrated that the adverse effects of packet loss alone are negligible, while the encoding of speech, particularly at a low bit rate, coupled with packet loss, can reduce the speaker recognition accuracy considerably. The fourth part discusses these experimental observations and refers to robustness approaches.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Besacier, L., Bergamini, C., Vaufreydaz, D., Castelli E. (2001). The effect of speech and audio compression on speech recognition performance. IEEE Multimedia Signal Process-ing Workshop, Cannes, France, October 2001.Google Scholar
  2. Besacier, L., Bonastre, J.-F., Mayorga, P., Fredouille, C., and Meignier, S. (2003). Overview of compression and packet loss effects in speech biometrics. IEE Proceedings Vision, Image & Signal Processing—Special Issue on Biometrics on the Internet, vol. 150, no. 6.Google Scholar
  3. ETSI Consortium. (1998). Telecommunication and internet protocol harmonization over networks: General aspects of quality of service. ETSI Technical Report.Google Scholar
  4. Finke, M., Geutner, P., Hild, H., Kemp, T., Ries, K., and Westphal, M. (1997). The Karlsruhe-Verbmobil speech recognition engine. In Proceedings of ICASSP, Munich, Germany, vol. 1, pp. 83-86.Google Scholar
  5. Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaus-sian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 291-298.CrossRefGoogle Scholar
  6. Gerson, I., and Jasiuk, M. (1993). A 5600 bps VSELP speech coder candidate for half rate GSM. In Proceedings Eurospeech’93, vol. 1, pp. 253-256.Google Scholar
  7. Goldberg, R., and Riek, L. (2000). A Practical Handbook of Speech Coders. CRC Press, Boca Raton, FL.MATHGoogle Scholar
  8. Järvinen, K. (1997). GSM enhanced full rate codec. In Proceedings of ICASSP, vol. 2, pp. 771-774.Google Scholar
  9. Lamel, L., Gauvain, J.-L., and Eskénazi, M. (1991). BREF, a large vocabulary spoken coprus for French. In Proceedings of Eurospeech, Gênes, Italy, vol. 2, pp. 505-508.Google Scholar
  10. Magrin-Chagnolleau, I., Gravier, G., and Blouet, R. (2001). Overview of the ELISA consor-tium research activities. In Proceedings. 2001: A Speaker Odyssey, pp. 67-72.Google Scholar
  11. Mayorga, P., Besacier, L., Lamy, R., and Serignat, J.-F. (2003). Audio packet loss over IP and speech recognition. Procedings ASRU 2003 (Automatic Speech Recognition & Under-standing), Virgin Islands.Google Scholar
  12. Meignier, S., Merlin, T., Blouet, R., and Bonastre, J.-F. (2002). NIST 2002 speaker recogni-tion evaluation: LIA results. Proceedings NIST 2002 Speaker Recognition Workshop, Vienna, Virginia.Google Scholar
  13. Messer, K., Matas, J., Kittler, J., Luettin, J., and Maitre, G. (1999). XM2VTSbd: The extended M2VTS database. Proceedings of 2nd Conference on Audio and Video-Base Biometric Personal Verification (AVBPA99), Springer Verlag, New York.Google Scholar
  14. Metze, F., McDonough, J., and Soltau, H. (2001). Speech recognition over netmeeting con-nection. Proceedings of Eurospeech, Aalborg, Denmark.Google Scholar
  15. The ELISA Consortium (2000). The ELISA systems for the NIST 99 evaluation in speaker detection and tracking. Digital Signal Processing, a Review Journal—Special Issue on , NIST 99 Speaker Recognition Workshop, pp. 143-153.Google Scholar
  16. Vaufreydaz, D., Akbar, M., Rouillard, J., and Caelen, J. (1999). Internet documents: A rich source for spoken language modeling. Proceedings ASRU Workshop, Keystone, Colo-rado, pp. 277-280.Google Scholar
  17. Yajnik, M., Moon, S., Kurose, J., and Towsley, D. (1999). Measurement and modelling of temporal dependence in packet loss. In Proceedings IEEE Infocom’99, New York.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Laurent Besacier
    • 1
  1. 1.LIG LaboratoryUniversity J. FourierGrenobleFrance

Personalised recommendations