Software Architectures for Networked Mobile Speech Applications
We examine architectures for mobile speech applications. These use speech engines for synthesizing audio output and for recognizing audio input; a key architectural decision is whether to embed these speech engines on the mobile device or to locate them in the network. While both approaches have advantages, our focus here is on networked speech application architectures. Because user experience with speech is greatly improved when the speech modality is coupled with a visual modality, mobile speech applications will increasingly tend to be multimodal, so speech architectures therefore must support multimodal user interaction. Good architectures must reflect commercial reality and be economical, efficient, robust, reliable, and scalable. They must leverage existing commercial ecosystems if possible, and we contend that speech and multimodal applications must build on both the web model of application development and deployment, and the large ecosystem that has grown up around the W3C’s web speech standards.
Keywords
Speech Recognition User Agent Internet Engineer Task Force Speech Recognizer Voice ServerPreview
Unable to display preview. Download preview PDF.
References
- Atkins, D., Ball, T., Baran, T., Benedikt, M., Cox, K., Ladd, D., Mataga, P., Puchol, C., Ramming, J.C., Rehor, K., and Tuckey, C. (1997) Mawl: Integrated web and telephone service creation. Bell Labs Technical Journal, 2(1), pp. 19-35.CrossRefGoogle Scholar
- Auburn, R. (2007) Voice browser call control: CCXML version 1.0, W3C Working Draft, http://www.w3.org/TR/ccxml/
- Axelsson, J., Cross, C., Ferrans, J., McCobb, G., Raman, T., and Wilson, L. (2004) XHTML+Voice Profile 1.2, VoiceXML Forum, March 2004, http://www.voicexml.org/specs/multimodal/x+v/12/spec.html
- Boyer, L., Danielsen, P., Ferrans, J., Karam, G., Ladd, D., Lucas, B., and Rehor, K. (2000) Voice Extensible Markup Language (VoiceXML) version 1.0, VoiceXML Forum. Bryant, R. (2007) Data-intensive supercomputing: The case for DISC, CMU Technical Report CMU-CS-07-128. May 10, 2007.Google Scholar
- Burke, D. and McGlashan, S. (2006) Video interactive services with VoiceXML. VoiceXML Review, 6(2), March/April 2006, http://www.voicexml.org/Review/Mar2006/features/video_interactive_services.html
- Delaney, B., Simunic, T., and Jayant, N. (2005) Energy-aware distributed speech recognition for wireless mobile devices. IEEE Design and Test of Computers, 22(1), pp. 39-49.CrossRefGoogle Scholar
- Deng, L. and Huang, X. (2004) Challenges in adopting speech recognition. CACM, 47(1), pp. 69-75.Google Scholar
- Engelsma, J. and Cross, C. (2007) Distributed multimodal synchronization protocol, IETF Internet Draft, (Work in Progress), January 2007.Google Scholar
- Engelsma, J. and Ferrans, J. (2007) Bypassing bluetooth device discovery using a multimodal user interface, In Proceedings of the 4th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (Mobiquitous 2007), Philadelphia, PA.Google Scholar
- Ferrans, J. (2003) The Motorola VoxGateway, lessons learned. VoiceXML Review, 3 (4), July/August 2003, http://www.voicexmlreview.org/Jul2003.
- Harel, D. (1987) Statecharts: A visual formalism for complex systems. Science Computer Pro-gramming, 8, pp. 231-274.MATHCrossRefMathSciNetGoogle Scholar
- Kamvar, M. and Baluja, S. (2005) A large scale study of wireless search behavior: Google Mobile Search. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (CHI 2005), pp. 701-709.Google Scholar
- Kennedy, N.(2005) Igor Jablokov interview on multimodal search, October16,2005, http://www.niallkennedy.com/blog/archives/2005/10/igor_jablokov_interview_on_mul.html
- Ladd, D., Hay, M., McClaughrey, P., and Ferrans, J. (1999) VoxML 1.1 Language Reference, http://www.w3.org/Voice/1999/VoxML.pdf
- Maes, S. and Saraswat, V. (2003) Multimodal interaction requirements, W3C Note, http://www.w3.org/TR/mmi-reqs
- McGlashan, S., Burnett, D., Carter, J., Danielsen, P., Ferrans, J., Hunt, A., Lucas, B., Porter, B., Rehor, K., and Tryphonas, S. (2004) Voice Extensible Markup Language (VoiceXML) version 2.0, W3C Recommendation, http://www.w3.org/TR/voicexml20
- Neurosky (2007) http://www.neurosky.com
- Open Mobile Alliance (2006) OMA multimodal and multi-device enabler architecture, OMA-AD-MMMD-V1_0-20061011-D, October 2006, http://member.openmobilealliance.org/ftp/Public_documents/BT/MAE/Permanent_documents/OMA-AD-MMMD-V1_0-20061011-D.zip
- Oviatt, S., (2000) Taming recognition errors with a multimodal interface. CACM, 43(9), pp. 45-51.Google Scholar
- Pearce, D. (2000) Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities for distributed speech recognition front-ends. In Proceedings of Ap-plied Voice Input/Output Society Conference (AVIOS 2000), San Jose, CA.Google Scholar
- Pearce, D. (2004) Robustness to transmission channel—The DSR approach. In Proceedings COST278 & ISCA Research Workshop on Robustness Issues in Conversational Interaction.Google Scholar
- Pearce, D., Engelsma, J., Ferrans, J., and Johnson, J. (2005) An architecture for seamless access to distributed multimodal services. In Proceedings of 9th European Conference on Speech Com-munication and Technology (Interspeech 2005), pp. 2845-2848.Google Scholar
- Pearce, M. (2002) Pearce principle, private communication, January 2002.Google Scholar
- Raggett, D. (1999) Introduction to TalkML, http://www.w3.org/Voice/TalkML/
- Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and Schooler, E. (2002) SIP: Session Initiation Protocol. IETF RFC 3261, June 2002, http://www.ietf.org/rfc/rfc3261.txt
- Shanmugham, P., Monaco, P., and Eberman, B. (2006) A media resource control protocol (MRCP). IETF RFC 4463, April 2006, http://www.rfc-editor.org/rfc/rfc4463.txt
- Suhm, B., Myers, B., and Waibel, A. (2001) Multimodal error correction for speech interfaces. ACM Transactions on Computer-Human Interaction, 8(1), pp. 60-98, March 2001.CrossRefGoogle Scholar
- Sutherland, I. and Danielsen, P. (2006) VoiceXML and voice-over-IP. VoiceXML Review, 6(3), September/October 2006. http://www.voicexml.org/Review/Oct2006/features/voip.html
- Zyda, M., Thukral, D., Jakatdar, S., Engelsma, J., Ferrans, J., Hans, M., Shi, L., Kitson, F., and Vasudevan, V. (2007) Educating the next generation of mobile game developers. IEEE Com-puter Graphics and Applications, 27(2), pp. 95-96.Google Scholar