This chapter introduces the basic features of speech recognition over an IP-based network. First of all, we review typical lossy packet channel models and several speech coders used for voice over IP, where the performance of a network speech recognition (NSR) system can significantly degrade. Second, several techniques for maintaining the performance of NSR against packet loss are addressed. The techniques are classified into client-based techniques and server-based techniques; the former ones include rate control approaches, forward error correction, and interleaving, and the latter ones include packet loss concealment and ASR-decoder based concealment. The last part of this chapter is devoted to explaining a new framework of NSR over IP networks. In particular, a speech coder that is optimized for automatic speech recognition (ASR) is presented, where it provides speech quality comparable to the conventional standard speech coders used in the IP networks. In addition, we compare the performance of NSR using the ASR-optimized speech coder to that using a conventional speech coder.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 3GPP TS 26.090 (1999). Mandatory speech codec speech processing functions; AMR speech codec; transcoding functions. v3.10, Dec.Google Scholar
  2. Anandakumar, A. K., McCree, A. V. and Viswanathan, V. (2000). Efficient CELP-based diver-sity schemes for VoIP. In Proceedings of ICASSP, pp. 3682-3685.Google Scholar
  3. Borella, M. S. (2000). Measurement and interpretation of Internet packet loss. Journal of Com-munication and Networks, vol. 2, no. 2, pp. 93-102.Google Scholar
  4. Chandra, P. and Lide, D. (2007). Wi-Fi Telephony: Challenges and solutions for voice over WLANs. Elsevier Inc.Google Scholar
  5. de Martin, J.C., Unno, T. and Viswanathan, V. (2000). Improved frame erasure concealment for CELP-based coders. In Proceedings of ICASSP, pp. 1483-1486.Google Scholar
  6. Eriksson, T., Lindén, J. and Skoglund, J. (1999). Interframe LSF quantization for noisy channels. IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, pp. 495-509.CrossRefGoogle Scholar
  7. ETSI Standard ES 201 108 (2003). Speech processing, transmission and quality aspects; Dis-tributed speech recognition; front-end feature extraction algorithm; compression algorithm. v.1.1.3, Sept.Google Scholar
  8. Falavigna, D., Matassoni, M. and Turchetti, S. (2003). Analysis of different acoustic front-ends for automatic voice over IP recognition. In Proceedings of ASRU, pp. 363-368.Google Scholar
  9. Fingscheidt, T., Aalbury, S., Stan, S. and Beaugeant, C. (2002). Network-based versus distrib-uted speech recognition in adaptive multi-rate wireless systems. In Proceedings of ICSLP, pp. 2209-2212.Google Scholar
  10. Gallardo-Antolín, A., Díaz-de-María, F. and Valverde-Albacete, F. (1998). Recognition from GSM digital speech. In Proceedings of ICSLP, pp. 1443-1446.Google Scholar
  11. Gómez, M., Peinado, A. M., Sánchez, V. and Rubio, A. J. (2006). Recognition of coded speech transmitted over wireless channels. IEEE Transactions on Wireless Communications, vol. 5, no. 9, pp. 2555-2562.CrossRefGoogle Scholar
  12. Hardman, V., Sasse, A., Handley, M. and Watson, A. (1995). Reliable audio for use over the Internet. In Proceedings of INET’95, pp. 171-178.Google Scholar
  13. Hirsch, G. (2002). Experimental framework for the performance evaluation of speech recogni-tion front-ends on a large vocabulary task. ETSI STQ Aurora DSR Working Group.Google Scholar
  14. Hooper, J. B. and Russell, M. J. (2000). Objective quality analysis of a voice over Internet pro-tocol system. Electronics Letters, vol. 36, no. 22, pp. 1900-1901.CrossRefGoogle Scholar
  15. ITU-T Recommendation G.191 (2000). Software tools for speech and audio coding standardiza-tion. Nov.Google Scholar
  16. ITU-T Recommendation G.729 (1996). Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear-prediction (CS-ACELP). Mar.Google Scholar
  17. ITU-T Recommendation P. 862 (2001). Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone net-works and speech codecs. Feb.Google Scholar
  18. Juang, B. H. and Gray, A. H. (1982). Multiple stage vector quantization for speech coding. In Proceedings of ICASSP, pp. 597-600.Google Scholar
  19. Kataoka, A. and Hayashi, S. (2007). A cryptic encoding method for G.729 using variation in bit-reversal sensitivity. Electronics and Communications in Japan (Part III: Fundamental Elec-tronic Science), vol. 90, no. 2, pp. 63-71.CrossRefGoogle Scholar
  20. Kim, H. K. and Cox, R. V. (2001). A bitstream-based front-end for wireless speech recognition on IS-136 communications system. IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp. 558-568.CrossRefGoogle Scholar
  21. Mayorga, P., Besacier, L., Lamy, R. and Serignat, J. Z. (2003). Audio packet loss over IP and speech recognition. In Proceedings of ASRU, pp. 607-612.Google Scholar
  22. Mayorga, P. and Besacier, L. (2006). Voice over IP and vocal recognition. In Proceedings of 3rd International Conference on Electrical and Electronics Engineering, pp. 1-4.Google Scholar
  23. Milner, B. and Semnani, S. (2000). Robust speech recognition over IP networks. In Proceedings of ICASSP, pp. 1791-1794.Google Scholar
  24. Milner, B. (2001). Robust speech recognition in burst-like packet loss. In Proceedings of ICASSP, pp. 261-264.Google Scholar
  25. Milner, B. and James, A. (2006). Robust speech recognition over mobile and IP networks in burst-like packet loss. IEEE Transactions on Audio Speech and Language Processing, vol. 14, no. 1, pp. 223-231.CrossRefGoogle Scholar
  26. Parihar, N. and Picone, J. (2001). DSR front end LVCSR evaluation - baseline recognition system description. ETSI STQ Aurora DSR Working Group.Google Scholar
  27. Peláez-Moreno, C., Gallardo-Antolín, A. and Díaz-de-María, F. (2001). Recognizing voice over IP: A robust front-end for speech recognition on the World Wide Web. IEEE Transactions on Multimedia, vol. 3, no. 2, pp. 209-218.CrossRefGoogle Scholar
  28. Perkins, C., Hodson, O. and Hardman, V. (1998). A survey of packet loss recovery techniques for streaming audio. IEEE Network, vol. 12, no. 5, pp. 40-48.CrossRefGoogle Scholar
  29. Ragot, S., Kovesi, B., Trilling, R., Virette, D., Duc, N., Massaloux, D., Proust, E., Geiser, B., Garter, M., Schandl, S., Taddei, H., Gao, Y., Shlomot, E., Ehara, H., Yoshida, K., Vailancourt, T., Salami, R., Lee, M. S. and Kim, D. Y. (2007). ITU-T G.729.1: an 8-32 kbit/s scalable coder interoperable with G.729 for wideband telephony and voice over IP. In Pro-ceedings of ICASSP, pp. 529-532.Google Scholar
  30. Ramaswamy, G. N. and Gopalakrishnan, P. S. (1998). Compression of acoustic features for speech recognition in network environments. In Proceedings of ICASSP, pp. 977-980.Google Scholar
  31. Ruggeri, G., Beritelli, E. and Casale, S. (2001). Hybrid multi-mode/multi-rate CS-ACELP speech coding for adaptive voice over IP. In Proceedings of ICASSP, pp. 733-736.Google Scholar
  32. Seo, J. W., Woo, S. J. and Bae, K. S. (2001). Study on the application of an AMR speech codec to VoIP. In Proceedings of ICASSP, pp. 1373-1376.Google Scholar
  33. Servetti, A. and De Martin, J. C. (2002). Perception-based partial encryption of compressed speech. IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, pp. 637-643.CrossRefGoogle Scholar
  34. Shacham, N. and McKenney, P. (1990). Packet recovery in high-speed networks using coding and buffer management. In Proceedings of IEEE INFOCOM, pp. 124-131.Google Scholar
  35. Sun, H., Shue, L. and Chen, J. (2004). Investigations into the relationship between measurable speech quality and speech recognition rate for telephony speech. In Proceedings of ICASSP, pp. 865-868.Google Scholar
  36. Swaminathan, K., Hammons Jr., A. T. and Austin, M. (1996). Selective error protection of ITU-T G.729 codec for digital cellular channels. In Proceedings of ICASSP, pp. 577-580.Google Scholar
  37. Takahashi, A., Yoshino, H. and Kitawaki, N. (2004). Perceptual QoS assessment technologies for VoIP. IEEE Communications Magazine, vol. 42, no. 7, pp. 28-34.CrossRefGoogle Scholar
  38. Tan, Z.-H., Dalsgaard, P. and Lindberg, B. (2005). Automatic speech recognition over error-probe wireless networks. Speech Communication, vol. 47, nos. 1-2, pp. 220-242.CrossRefGoogle Scholar
  39. Tan, Z.-H., Dalsgaard, P. and Lindberg, B. (2007). Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition. IEEETransactions on Audio Speech and Language Processing, vol. 15, no. 4, pp. 1391-1403.CrossRefGoogle Scholar
  40. Van Sciver, J., Ma, J. Z., Vanpoucke, F. and Van Hamme, H. (2002). Investigation of speech recognition over IP channels. In Proceedings of ICASSP, pp. 3813-3815.Google Scholar
  41. Wah, B. W. and Lin, D. (2005). LSP-based multiple-description coding for real-time low bit-rate voice over IP. IEEE Transactions on Multimedia, vol. 7, no. 1, pp. 167-178.CrossRefGoogle Scholar
  42. Walker, J. Q. and Hicks, J. T. (2004). Taking charge of your VoIP project: Strategies and solu-tions for successful VoIP deployments. Cisco Press, Indianapolis, IN.Google Scholar
  43. Yoon, J. S., Lee, G. H. and Kim, H. K. (2007). A MFCC-based CELP speech coder for server-based speech recognition in network environments. IEICE Transactions on Electronics, Communications and Computer Sciences, vol. E90-A, no. 3, pp. 626-632.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Hong Kook Kim
    • 1
  1. 1.Department of Information and CommunicationsGwangju Institute of Science and TechnologyGwangjuKorea

Personalised recommendations