We examine architectures for mobile speech applications. These use speech engines for synthesizing audio output and for recognizing audio input; a key architectural decision is whether to embed these speech engines on the mobile device or to locate them in the network. While both approaches have advantages, our focus here is on networked speech application architectures. Because user experience with speech is greatly improved when the speech modality is coupled with a visual modality, mobile speech applications will increasingly tend to be multimodal, so speech architectures therefore must support multimodal user interaction. Good architectures must reflect commercial reality and be economical, efficient, robust, reliable, and scalable. They must leverage existing commercial ecosystems if possible, and we contend that speech and multimodal applications must build on both the web model of application development and deployment, and the large ecosystem that has grown up around the W3C’s web speech standards.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Atkins, D., Ball, T., Baran, T., Benedikt, M., Cox, K., Ladd, D., Mataga, P., Puchol, C., Ramming, J.C., Rehor, K., and Tuckey, C. (1997) Mawl: Integrated web and telephone service creation. Bell Labs Technical Journal, 2(1), pp. 19-35.
Auburn, R. (2007) Voice browser call control: CCXML version 1.0, W3C Working Draft, http://www.w3.org/TR/ccxml/
Axelsson, J., Cross, C., Ferrans, J., McCobb, G., Raman, T., and Wilson, L. (2004) XHTML+Voice Profile 1.2, VoiceXML Forum, March 2004, http://www.voicexml.org/specs/multimodal/x+v/12/spec.html
Boyer, L., Danielsen, P., Ferrans, J., Karam, G., Ladd, D., Lucas, B., and Rehor, K. (2000) Voice Extensible Markup Language (VoiceXML) version 1.0, VoiceXML Forum. Bryant, R. (2007) Data-intensive supercomputing: The case for DISC, CMU Technical Report CMU-CS-07-128. May 10, 2007.
Burke, D. and McGlashan, S. (2006) Video interactive services with VoiceXML. VoiceXML Review, 6(2), March/April 2006, http://www.voicexml.org/Review/Mar2006/features/video_interactive_services.html
Delaney, B., Simunic, T., and Jayant, N. (2005) Energy-aware distributed speech recognition for wireless mobile devices. IEEE Design and Test of Computers, 22(1), pp. 39-49.
Deng, L. and Huang, X. (2004) Challenges in adopting speech recognition. CACM, 47(1), pp. 69-75.
Engelsma, J. and Cross, C. (2007) Distributed multimodal synchronization protocol, IETF Internet Draft, (Work in Progress), January 2007.
Engelsma, J. and Ferrans, J. (2007) Bypassing bluetooth device discovery using a multimodal user interface, In Proceedings of the 4th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (Mobiquitous 2007), Philadelphia, PA.
Ferrans, J. (2003) The Motorola VoxGateway, lessons learned. VoiceXML Review, 3 (4), July/August 2003, http://www.voicexmlreview.org/Jul2003.
Harel, D. (1987) Statecharts: A visual formalism for complex systems. Science Computer Pro-gramming, 8, pp. 231-274.
Kamvar, M. and Baluja, S. (2005) A large scale study of wireless search behavior: Google Mobile Search. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (CHI 2005), pp. 701-709.
Kennedy, N.(2005) Igor Jablokov interview on multimodal search, October16,2005, http://www.niallkennedy.com/blog/archives/2005/10/igor_jablokov_interview_on_mul.html
Ladd, D., Hay, M., McClaughrey, P., and Ferrans, J. (1999) VoxML 1.1 Language Reference, http://www.w3.org/Voice/1999/VoxML.pdf
Maes, S. and Saraswat, V. (2003) Multimodal interaction requirements, W3C Note, http://www.w3.org/TR/mmi-reqs
McGlashan, S., Burnett, D., Carter, J., Danielsen, P., Ferrans, J., Hunt, A., Lucas, B., Porter, B., Rehor, K., and Tryphonas, S. (2004) Voice Extensible Markup Language (VoiceXML) version 2.0, W3C Recommendation, http://www.w3.org/TR/voicexml20
Neurosky (2007) http://www.neurosky.com
Open Mobile Alliance (2006) OMA multimodal and multi-device enabler architecture, OMA-AD-MMMD-V1_0-20061011-D, October 2006, http://member.openmobilealliance.org/ftp/Public_documents/BT/MAE/Permanent_documents/OMA-AD-MMMD-V1_0-20061011-D.zip
Oviatt, S., (2000) Taming recognition errors with a multimodal interface. CACM, 43(9), pp. 45-51.
Pearce, D. (2000) Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities for distributed speech recognition front-ends. In Proceedings of Ap-plied Voice Input/Output Society Conference (AVIOS 2000), San Jose, CA.
Pearce, D. (2004) Robustness to transmission channel—The DSR approach. In Proceedings COST278 & ISCA Research Workshop on Robustness Issues in Conversational Interaction.
Pearce, D., Engelsma, J., Ferrans, J., and Johnson, J. (2005) An architecture for seamless access to distributed multimodal services. In Proceedings of 9th European Conference on Speech Com-munication and Technology (Interspeech 2005), pp. 2845-2848.
Pearce, M. (2002) Pearce principle, private communication, January 2002.
Raggett, D. (1999) Introduction to TalkML, http://www.w3.org/Voice/TalkML/
Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and Schooler, E. (2002) SIP: Session Initiation Protocol. IETF RFC 3261, June 2002, http://www.ietf.org/rfc/rfc3261.txt
Shanmugham, P., Monaco, P., and Eberman, B. (2006) A media resource control protocol (MRCP). IETF RFC 4463, April 2006, http://www.rfc-editor.org/rfc/rfc4463.txt
Suhm, B., Myers, B., and Waibel, A. (2001) Multimodal error correction for speech interfaces. ACM Transactions on Computer-Human Interaction, 8(1), pp. 60-98, March 2001.
Sutherland, I. and Danielsen, P. (2006) VoiceXML and voice-over-IP. VoiceXML Review, 6(3), September/October 2006. http://www.voicexml.org/Review/Oct2006/features/voip.html
Zyda, M., Thukral, D., Jakatdar, S., Engelsma, J., Ferrans, J., Hans, M., Shi, L., Kitson, F., and Vasudevan, V. (2007) Educating the next generation of mobile game developers. IEEE Com-puter Graphics and Applications, 27(2), pp. 95-96.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this chapter
Cite this chapter
Ferrans, J.C., Engelsma, J. (2008). Software Architectures for Networked Mobile Speech Applications. In: Automatic Speech Recognition on Mobile Devices and over Communication Networks. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-143-5_13
Download citation
DOI: https://doi.org/10.1007/978-1-84800-143-5_13
Publisher Name: Springer, London
Print ISBN: 978-1-84800-142-8
Online ISBN: 978-1-84800-143-5
eBook Packages: Computer ScienceComputer Science (R0)