Network, Distributed and Embedded Speech Recognition: An Overview

  • Zheng-Hua Tan
  • Imre Varga
Part of the Advances in Pattern Recognition book series (ACVPR)

As mobile devices become pervasive and small, the design of efficient user interfaces is rapidly developing into a major issue. The expectation for speech-centric interfaces has stimulated a great interest in deploying automatic speech recognition (ASR) on devices like mobile phones, PDAs and automobiles. Mobile devices are characterised as having limited computational power, memory size and battery life, whereas state-of-the-art ASR systems are computationally intensive. To circumvent these restrictions, a great deal of effort has therefore been spent on enabling efficient ASR implementation on embedded platforms, primarily through fixed-point arithmetic and algorithm optimisation for low computational complexity and memory footprint. The restrictions can also be largely bypassed from the architecture side: Distributed speech recognition (DSR) splits ASR processing into the client based feature extraction and the server based recognition. The relief of computational burden on mobile devices, however, comes at the cost of network deteriorations and additional components such as feature quantisation, error recovery and concealment. An alternative to DSR is network speech recognition that uses a conventional speech coder for speech transmission from client to server. Over the past decade, these areas have undergone substantial development. This chapter gives a comprehensive overview of the areas and discusses the pros and cons of different approaches. The optimal choice is made according to the complexity of ASR components, the resources available on the device and in the network and the location of associated applications.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 3GPP TS 26.243 (2004) ANSI C Code for the fixed-point distributed speech recognition extended advanced front-end.Google Scholar
  2. 3GPP TR 26.943 (2004) Recognition performance evaluations of codecs for Speech Enabled Services (SES).Google Scholar
  3. Bernard, A., and Alwan, A. (2002) Low-bitrate distributed speech recognition for packet-based and wireless communication. IEEE Transanctions on Speech and Audio Processing, vol. 10, no. 8, pp. 570-579.CrossRefGoogle Scholar
  4. Borgstrom, B.J., and Alwan, A. (2007) A packetization and variable bitrate interframe compression scheme for vector quantizer-based distributed speech recognition. In Proceedings of Interspeech, Antwerp, Belgium.Google Scholar
  5. Bossert, M. (2000) Channel Coding for Telecommunications. John Wiley & Sons.Google Scholar
  6. Boulis, C., Ostendorf, M., Riskin, E. A., and Otterson, S. (2002) Graceful degradation of speech recognition performance over packet-erasure networks. IEEE Transanctions on Speech and Audio Processing, vol. 10, no. 8, pp. 580-590.CrossRefGoogle Scholar
  7. Davis, S., and Mermelstein, P. (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on. Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366.CrossRefGoogle Scholar
  8. Deller, J., Hansen, J., and Proakis, J. (1999) Discrete-Time Processing of Speech Signals, 2nd Edition. Wiley-IEEE Press.Google Scholar
  9. Digalakis, V., Neumeyer, L., and Perakakis, M. (1999) Quantization of cepstral parameters for speech recognition over the World Wide Web. IEEE Journal on Selected Areas in Communications, vol. 17, no. 1, pp. 82-90.CrossRefGoogle Scholar
  10. ETSI Standard ES 201 108 (2000) Distributed speech recognition; front-end feature extraction algorithm; compression algorithm, v1.1.2.Google Scholar
  11. ETSI Standard ES 202 050 (2002) Distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithm.Google Scholar
  12. ETSI Standard ES 202 211 (2003) Distributed speech recognition; extended front-end feature extraction algorithm; compression algorithm, back-end speech reconstruction algorithm.Google Scholar
  13. ETSI Standard ES 202 212 (2003) Distributed speech recognition; extended advanced front-end feature extraction algorithm; compression algorithm, back-end speech reconstruction algorithm.Google Scholar
  14. Euler, S., and Zinke, J. (1994) The influence of speech coding algorithms on automatic speech recognition. Proceedings of ICASSP, Adelaide, Australia.Google Scholar
  15. Fingscheidt, T., Aalburg, S., Stan, S., and Beaugeant, C. (2002) Network based vs. distributed speech recognition in adaptive multi-rate wireless systems. Proceedings of ICSLP, Denver, USA.Google Scholar
  16. Gomez, A.M., Peinado, A.M., Sanchez, V., and Rubio, A.J. (2003) A source model mitigation technique for distributed speech recognition over lossy packet channels. Proceedings of Eurospeech, Geneva, Switzerland.Google Scholar
  17. Hirsch, H.G., and Pearce D. (2000) The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proceedings of ISCA ITRW ASR, Paris, France. Google Scholar
  18. Hsu, W.-H., and Lee, L.-S. (2004) Efficient and robust distributed speech recognition (DSR) over wireless fading channels: 2D-DCT compression, iterative bit allocation, short BCH code and interleaving. Proceedings of ICASSP, Montreal, Canada.Google Scholar
  19. Ion, V., and Haeb-Umbach, R. (2006) Uncertainty decoding for distributed speech recognition over error-prone networks, Speech Communication, vol. 48, pp. 1435-1446.CrossRefGoogle Scholar
  20. James, A.B., and Milner, B.P. (2004) An analysis of interleavers for robust speech recognition in burst-like packet loss. Proceedings of ICASSP, Montreal, Canada.Google Scholar
  21. Kim, H.K., and Cox, R.V. (2001) A bitstream-based front-end for wireless speech recognition on IS-136 communications system. IEEE Transanctions on Speech and Audio Processing, vol. 9, no. 5, pp. 558-568.CrossRefGoogle Scholar
  22. Lam, Y.-M., Mak, M.-W., and Leong, Ph. H.-W. (2003) Fixed-point implementations of speech recognition systems. Proceedings of ISPC, Dallas, USA.Google Scholar
  23. Lilly, B.T., and Paliwal, K.K. (1996) Effect of speech coders on speech recognition performance. In Proceedings of ICSLP, pp. 2344-2347, Philadelphia, PA, USA.Google Scholar
  24. Li, X., Malkin, J., and Bilmes, J. (2006) A high-speed, low-resource ASR back-end based on custom arithmetic. IEEE Transanctions on Speech and Audio Processing, vol. 14, no. 5, pp. 1683-1693.CrossRefGoogle Scholar
  25. Mayorga, P., Besacier, L., Lamy, R., and Serignat, J.-F. (2003) Audio packet loss over IP and speech recognition. Proceedings of Automatic Speech Recognition and Understanding, Virgin Islands, USA.Google Scholar
  26. Milner, B., and Shao, X. (2007) Prediction of fundamental frequency and voicing from Mel-frequency cepstral coefficients for unconstrained speech reconstruction. IEEE Transac-tions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 24-33.CrossRefGoogle Scholar
  27. Novak, M. (2004) Towards large vocabulary ASR on embedded platforms. Proceedings of ICSLP, Jeju Island, Korea.Google Scholar
  28. Pearce, D. (2000) Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities for distributed speech recognition front-ends. Proceedings of Applied Voice Input/Output Society Conference, San Jose, CA, USA.Google Scholar
  29. Pearce, D. (2004) Robustness to transmission channel—The DSR approach. Proceedings of COST278 & ISCA Research Workshop on Robustness Issues in Conversational Inter-action, Norwich, UK.Google Scholar
  30. Peinado, A., and Segura, J.C. (2006) Speech Recognition Over Digital Channels. Wiley.Google Scholar
  31. Peinado, A., Sanchez, V., Perez-Cordoba, J., and de la Torre, A. (2003) HMM-based channel error mitigation and its application to distributed speech recognition. Speech Com-munication, vol. 41, pp. 549-561.CrossRefGoogle Scholar
  32. Peláez-Moreno, C., Gallardo-Antolín, A., and Díaz-de-María, F. (2001) Recognizing voice over IP: A robust front-end for speech recognition on the World Wide Web. IEEE Transanctions on Multimedia, vol. 3, no. 2, pp. 209-218.CrossRefGoogle Scholar
  33. So, S., and Paliwal, K.K. (2006) Scalable distributed speech recognition using Gaussian mixture model-based block quantisation. Speech Communication, vol. 48, pp. 746-758.CrossRefGoogle Scholar
  34. Srinivasamurthy, N., Ortega, A., and Narayanan, S. (2006) Efficient scalable encoding for distributed speech recognition. Speech Communication, vol. 48, no. 8, pp. 888-902.CrossRefGoogle Scholar
  35. Tan, Z.-H., Dalsgaard, P., and Lindberg, B. (2005) Automatic speech recognition over error-prone wireless networks. Speech Communication, vol. 47, no. 1-2, pp. 220-242.CrossRefGoogle Scholar
  36. Tan, Z.-H., Dalsgaard, P., and Lindberg, B. (2007a) Exploiting temporal correlation of speech for error-robust and bandwidth-flexible distributed speech recognition. IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 4, pp. 1391-1403.CrossRefGoogle Scholar
  37. Tan, Z.-H., and Lindberg, B. (2007b) A variable frame rate method for distributed speech recognition over wireless networks. Proceedings of the 10th International Symposium on Wireless Personal Multimedia Communications, Jaipur, India.Google Scholar
  38. Varga, I., Aalburg, S., Andrassy, B., Astrov, S., Bauer, J.G., Beaugeant, Ch., Geissler, Ch., and Höge, H. (2002) ASR in mobile phones—An industrial approach. IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, pp. 562-569.CrossRefGoogle Scholar
  39. Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., and Woelfel, J. (2004) Sphinx-4: A Flexible Open Source Framework for Speech Recognition. Technical report TR-2004-139, Sun corporation, USA.Google Scholar
  40. Wan, C.-Y., and Lee, L.-S. (2006). Joint uncertainty decoding (JUD) with histogram-based quantization (HQ) for robust and/or distributed Speech Recognition. Proceedings of ICASSP, Toulouse, France.Google Scholar
  41. Xie, Q., and Pearce, D. (2004) RTP Payload Formats for ETSI ES 202 050, ES 202 211, and ES 202 212 Distributed Speech Recognition Encoding.Google Scholar
  42. Xu, H., Tan, Z.-H., Dalsgaard, P., Mattethat, R., and Lindberg, B. (2006) A configurable distributed speech recognition system. In H. Abut, J.H.L. Hansen, and K. Takeda (eds.), Digital Signal Processing for In-Vehicle and Mobile Systems 2. Springer Science, New York.Google Scholar
  43. Zhou, B., Dechelotte, D., and Gao, Y. (2004) Two-way speech-to-speech translation on handheld devices. Proceedings of ICSLP Jeju Island, Korea.Google Scholar
  44. Zhu, Q., and Alwan, A. (2001) An efficient and scalable 2D DCT-based feature coding scheme for remote speech recognition. Proceedings of ICASSP, Salt Lake City, USA.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Zheng-Hua Tan
    • 1
  • Imre Varga
    • 2
  1. 1.Department of Electronic SystemsAalborg UniversityAalborgDenmark
  2. 2.Corporate TechnologySiemens AGGermany

Personalised recommendations