Software Architectures for Networked Mobile Speech Applications

  • James C. Ferrans
  • Jonathan Engelsma

We examine architectures for mobile speech applications. These use speech engines for synthesizing audio output and for recognizing audio input; a key architectural decision is whether to embed these speech engines on the mobile device or to locate them in the network. While both approaches have advantages, our focus here is on networked speech application architectures. Because user experience with speech is greatly improved when the speech modality is coupled with a visual modality, mobile speech applications will increasingly tend to be multimodal, so speech architectures therefore must support multimodal user interaction. Good architectures must reflect commercial reality and be economical, efficient, robust, reliable, and scalable. They must leverage existing commercial ecosystems if possible, and we contend that speech and multimodal applications must build on both the web model of application development and deployment, and the large ecosystem that has grown up around the W3C’s web speech standards.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Atkins, D., Ball, T., Baran, T., Benedikt, M., Cox, K., Ladd, D., Mataga, P., Puchol, C., Ramming, J.C., Rehor, K., and Tuckey, C. (1997) Mawl: Integrated web and telephone service creation. Bell Labs Technical Journal, 2(1), pp. 19-35.CrossRefGoogle Scholar
  2. Auburn, R. (2007) Voice browser call control: CCXML version 1.0, W3C Working Draft, http://www.w3.org/TR/ccxml/
  3. Axelsson, J., Cross, C., Ferrans, J., McCobb, G., Raman, T., and Wilson, L. (2004) XHTML+Voice Profile 1.2, VoiceXML Forum, March 2004, http://www.voicexml.org/specs/multimodal/x+v/12/spec.html
  4. Boyer, L., Danielsen, P., Ferrans, J., Karam, G., Ladd, D., Lucas, B., and Rehor, K. (2000) Voice Extensible Markup Language (VoiceXML) version 1.0, VoiceXML Forum. Bryant, R. (2007) Data-intensive supercomputing: The case for DISC, CMU Technical Report CMU-CS-07-128. May 10, 2007.Google Scholar
  5. Burke, D. and McGlashan, S. (2006) Video interactive services with VoiceXML. VoiceXML Review, 6(2), March/April 2006, http://www.voicexml.org/Review/Mar2006/features/video_interactive_services.html
  6. Delaney, B., Simunic, T., and Jayant, N. (2005) Energy-aware distributed speech recognition for wireless mobile devices. IEEE Design and Test of Computers, 22(1), pp. 39-49.CrossRefGoogle Scholar
  7. Deng, L. and Huang, X. (2004) Challenges in adopting speech recognition. CACM, 47(1), pp. 69-75.Google Scholar
  8. Engelsma, J. and Cross, C. (2007) Distributed multimodal synchronization protocol, IETF Internet Draft, (Work in Progress), January 2007.Google Scholar
  9. Engelsma, J. and Ferrans, J. (2007) Bypassing bluetooth device discovery using a multimodal user interface, In Proceedings of the 4th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (Mobiquitous 2007), Philadelphia, PA.Google Scholar
  10. Ferrans, J. (2003) The Motorola VoxGateway, lessons learned. VoiceXML Review, 3 (4), July/August 2003, http://www.voicexmlreview.org/Jul2003.
  11. Harel, D. (1987) Statecharts: A visual formalism for complex systems. Science Computer Pro-gramming, 8, pp. 231-274.MATHCrossRefMathSciNetGoogle Scholar
  12. Kamvar, M. and Baluja, S. (2005) A large scale study of wireless search behavior: Google Mobile Search. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (CHI 2005), pp. 701-709.Google Scholar
  13. Kennedy, N.(2005) Igor Jablokov interview on multimodal search, October16,2005, http://www.niallkennedy.com/blog/archives/2005/10/igor_jablokov_interview_on_mul.html
  14. Ladd, D., Hay, M., McClaughrey, P., and Ferrans, J. (1999) VoxML 1.1 Language Reference, http://www.w3.org/Voice/1999/VoxML.pdf
  15. Maes, S. and Saraswat, V. (2003) Multimodal interaction requirements, W3C Note, http://www.w3.org/TR/mmi-reqs
  16. McGlashan, S., Burnett, D., Carter, J., Danielsen, P., Ferrans, J., Hunt, A., Lucas, B., Porter, B., Rehor, K., and Tryphonas, S. (2004) Voice Extensible Markup Language (VoiceXML) version 2.0, W3C Recommendation, http://www.w3.org/TR/voicexml20
  17. Neurosky (2007) http://www.neurosky.com
  18. Open Mobile Alliance (2006) OMA multimodal and multi-device enabler architecture, OMA-AD-MMMD-V1_0-20061011-D, October 2006, http://member.openmobilealliance.org/ftp/Public_documents/BT/MAE/Permanent_documents/OMA-AD-MMMD-V1_0-20061011-D.zip
  19. Oviatt, S., (2000) Taming recognition errors with a multimodal interface. CACM, 43(9), pp. 45-51.Google Scholar
  20. Pearce, D. (2000) Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities for distributed speech recognition front-ends. In Proceedings of Ap-plied Voice Input/Output Society Conference (AVIOS 2000), San Jose, CA.Google Scholar
  21. Pearce, D. (2004) Robustness to transmission channel—The DSR approach. In Proceedings COST278 & ISCA Research Workshop on Robustness Issues in Conversational Interaction.Google Scholar
  22. Pearce, D., Engelsma, J., Ferrans, J., and Johnson, J. (2005) An architecture for seamless access to distributed multimodal services. In Proceedings of 9th European Conference on Speech Com-munication and Technology (Interspeech 2005), pp. 2845-2848.Google Scholar
  23. Pearce, M. (2002) Pearce principle, private communication, January 2002.Google Scholar
  24. Raggett, D. (1999) Introduction to TalkML, http://www.w3.org/Voice/TalkML/
  25. Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and Schooler, E. (2002) SIP: Session Initiation Protocol. IETF RFC 3261, June 2002, http://www.ietf.org/rfc/rfc3261.txt
  26. Shanmugham, P., Monaco, P., and Eberman, B. (2006) A media resource control protocol (MRCP). IETF RFC 4463, April 2006, http://www.rfc-editor.org/rfc/rfc4463.txt
  27. Suhm, B., Myers, B., and Waibel, A. (2001) Multimodal error correction for speech interfaces. ACM Transactions on Computer-Human Interaction, 8(1), pp. 60-98, March 2001.CrossRefGoogle Scholar
  28. Sutherland, I. and Danielsen, P. (2006) VoiceXML and voice-over-IP. VoiceXML Review, 6(3), September/October 2006. http://www.voicexml.org/Review/Oct2006/features/voip.html
  29. Zyda, M., Thukral, D., Jakatdar, S., Engelsma, J., Ferrans, J., Hans, M., Shi, L., Kitson, F., and Vasudevan, V. (2007) Educating the next generation of mobile game developers. IEEE Com-puter Graphics and Applications, 27(2), pp. 95-96.Google Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • James C. Ferrans
    • 1
  • Jonathan Engelsma
    • 1
  1. 1.Motorola LabsUSA

Personalised recommendations