Abstract
This paper describes the World Wide Web Consortium’s (W3C) Multimodal Architecture and Interfaces (MMI Architecture) standard, an architecture and communications protocol that enables a wide variety of independent modalities to be integrated into multimodal applications. By encapsulating the functionalities of modality components and requiring all control information to go through the Interaction Manager, the MMI Architecture simplifies integrating components from multiple sources.
Similar content being viewed by others
Notes
For example, the StartRequest event might be mapped to a “startListening” method used by a modality-specific API.
In this example, we assume that the speech recognition component provides an interpretation of the input, in addition to the literal tokens of input, to allow for the user to express this request in other words, such as “Tell me about today’s weather”, or even “Will I need my umbrella?” However, the architecture supports interpreting the user’s input with a separate natural language understanding MC.
References
Turing A (1950) Computing machinery and intelligence. Mind 59:433–460
Johnston M, Bangalore S, Vasireddy G, Stent A, Ehlen P, Walker M, Whittaker S, Maloor P (2001) MATCH: an architecture for multimodal dialogue systems. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for, Computational Linguistics, Philadelphia, pp 376–383
Bayer S (2005) Building a standards and research community with the galaxy communicator software infrastructure. In: Dahl DA (ed) Practical spoken dialog systems, vol 26. TextSpeech and Language Technology. Kluwer Academic Publishers, Dordrecht, pp 166–196
Oviatt SL (1999) Ten myths of multimodal interaction. Commun ACM 42:74–81
Seneff S, Lau R, Polifroni J (1999) Organization, communication, and control in the Galaxy-II Conversational System. In: Proceedings of Eurospeech 1999, Budapest
Barnett J, Bodell M, Dahl DA, Kliche I, Larson J, Porter B, Raggett D, Raman TV, Rodriguez BH, Selvaraj M, Tumuluri R, Wahbe A, Wiechno P, Yudkowsky M (2012) Multimodal Architecture and Interfaces. World Wide Web Consortium. http://www.w3.org/TR/mmi-arch/. Accessed November 20 2012
Barnett J, Akolkar R, Auburn RJ, Bodell M, Burnett DC, Carter J, McGlashan S, Lager T, Helbing M, Hosn R, Raman TV, Reifenrath K, Rosenthal Na (2012) State chart XML (SCXML): state machine notation for control abstraction. World Wide Web Consortium. http://www.w3.org/TR/scxml/. Accessed November 20 2012
McGlashan S, Burnett DC, Carter J, Danielsen P, Ferrans J, Hunt A, Lucas B, Porter B, Rehor K, Tryphonas S (2004) Voice Extensible Markup Language (VoiceXML 2.0). W3C. http://www.w3.org/TR/voicexml20/. Accessed November 9 2012
Kopp S, Krenn B, Marsella S, Marshall A, Pelachaud C, Pirker H, Thórisson KR, Vilhjálmsson H (2006) Towards a common framework for multimodal generation: The behavior markup language. In: International conference on intelligent virtual agents, Marina del Rey, California
Heylen D, Kopp S, Marsella S, Pelachaud C, Vilhjalmsson H (2008) The next step towards a functional markup language. Paper presented at the Proceeding of Intelligent Virtual Agents (IVA 2008), Tokyo
Scherer S, Marsella S, Stratou G, Xu Y, Morbini F, Egan A, Rizzo A, Morency L-P (2012) Perception markup language: towards a standardized representation of perceived nonverbal behaviors. In: Nakano Y, Neff M, Paiva A, Walker M (eds) Intelligent virtual agents, vol 7502. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp 455–463. doi:10.1007/978-3-642-33197-8_47
Araki M, Tachibana K (2006) Multimodal dialog description language for Rapid system development. In: 7th SIGdial workshop on discourse and dialogue, Sydney
Rodriguez BH, Wiechno P, Dahl DA, Ashimura K, Tumuluri R (2012) Registration & discovery of multimodal modality components in multimodal systems: use cases and requirements. World Wide Web Consortium. http://www.w3.org/TR/mmi-discovery/. Accessed November 26 2012
Johnston M, Baggia P, Burnett D, Carter J, Dahl DA, McCobb G, Raggett D (2009) EMMA: extensible multimodal annotation markup language. W3C. http://www.w3.org/TR/emma/. Accessed November 9 2012
Bray T, Jean Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F (2004) Extensible Markup Language (XML) 1.0 (Third Edition). World Wide Web Consortium. http://www.w3.org/TR/2004/REC-xml-20040204/. Accessed November 9 2012
Burnett DC, Walker MR, Hunt A (2004) W3C speech synthesis markup language (SSML). W3C. http://www.w3.org/TR/speech-synthesis/
Oshry M, Auburn RJ, Baggia P, Bodell M, Burke D, Burnett DC, Candell E, Carter J, McGlashan S, Lee A, Porter B, Rehor K (2007) Voice extensible markup language (VoiceXML) 2.1. http://www.w3.org/TR/voicexml21/. Accessed November 9 2012
Popescu A (2012) Geolocation API specification. World Wide Web Consortium. http://www.w3.org/TR/geolocation-API/. Accessed November 27 2012
Kostiainen A, Oksanen I, Hazaël-Massieux D (2012) HTML media capture. World Wide Web Consortium. http://www.w3.org/TR/capture-api/. Accessed November 27 2012
Microsoft (2007) Microsoft speech API 5.3 (SAPI). http://msdn2.microsoft.com/en-us/library/ms723627.aspx
Java Speech API (1998) Sun microsystems. http://java.sun.com/products/java-media/speech/
SALT Forum (2002) Speech application language tags (SALT). http://www.saltforum.org
IBM (2003) X+V 1.1. http://www-3.ibm.com/software/pervasive/multimodal/x+v/11/spec.htm
Bodell M, Bringert B, Brown R, Burnett DC, Dahl DA, Druta D, Ehlen P, Hemphill C, Johnston M, Pettay O, Sampath S, Schröder M, Shires G, Tumuluri R, Young M (2011) HTML speech incubator group final report. World Wide Web Consortium. http://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/ . Accessed November 27 2012
Kliche I, Dahl DA, Larson JA, Rodriguez BH, Selvaraj M (2011) Best practices for creating MMI modality components. World Wide Web Consortium. http://www.w3.org/TR/2011/NOTE-mmi-mcbp-20110301/. Accessed November 20 2012
Watt SM, Underhill T, Chee Y-M, Franke K, Froumentin M, Madhvanath S, Magaña J-A, Pakosz G, Russell G, Selvaraj M, Seni G, Tremblay C, Yaeger L (2011) Ink markup language (InkML). World Wide Web Consortium. http://www.w3.org/TR/InkML. Accessed November 27 2012
Hickson I (2012) Server-sent Events. World Wide Web Consortium. http://www.w3.org/TR/eventsource/. Accessed November 20 2012
Hickson I (2012) The WebSocket API. The World Wide Web Consortium. http://www.w3.org/TR/websockets/. Accessed November 20 2012
Hunt A, McGlashan S (2004) W3C speech recognition grammar specification (SRGS). W3C. http://www.w3.org/TR/speech-grammar/. Accessed November 9 2012
Van Tichelen L, Burke D (2007) Semantic Interpretation for Speech Recognition. W3C. http://www.w3.org/TR/semantic-interpretation/. Accessed November 9 2012
Kliche I, Kharidi N, Wiechno P (2012) MMI interoperability test report. World Wide Web Consortium. http://www.w3.org/TR/2012/NOTE-mmi-interop-20120124/. Accessed November 27 2012
Fette I, Melnikov A (2011) RFC 6455: The WebSocket protocol. Internet engineering task force. http://tools.ietf.org/html/rfc6455. Accessed November 20 2012
Bergkvist A, Burnett DC, Jennings C, Narayanan A (2012) WebRTC 1.0: real-time communication between browsers. World Wide Web Consortium. http://www.w3.org/TR/webrtc/. Accessed November 28 2012
Johnston M, Dahl DA, Kliche I, Baggia P, Burnett DC, Burkhardt F, Ashimura K (2009) Use cases for possible future EMMA features. World Wide Web Consortium. http://www.w3.org/TR/emma-usecases/
Wiechno P, Kharidi N, Kliche I, Rodriguez BH, Schnelle-Walka D, Dahl DA, Ashimura K (2012) Multimodal architecture and interfaces 1.0 implementation report. World Wide Web Consortium. http://www.w3.org/2002/mmi/2012/mmi-arch-ir/. Accessed November 27 2012
Openstream I (2013) Solutions. http://www.openstream.com/solutions.htm. Accessed March 15 2013
Rodriguez BH, Moissianc J-C, Demeure I (2010) Multimodal instantiation of assistance services. In: iiWAS ’10 proceedings of the 12th international conference on information integration and web-based applications & services Paris, ACM, France, pp 934–937
Pous M, Ceccaroni L (2010) Multimodal interaction in distributed and ubiquitous computing. In: Fifth international conference on internet and web applications and services (ICIW), Barcelona, Spain
Acknowledgments
The W3C Multimodal Architecture and Interfaces and EMMA specifications represent the work of many individuals who have participated in the Multimodal Interaction Working Group. In particular, I would like to acknowledge the work of the following authors of the MMI Architecture and EMMA specifications and related documents. Work of the following authors Kazuyuki Ashimura, Jim Barnett, Paolo Baggia, Michael Bodell, Daniel C. Burnett, Jerry Carter, Michael Johnston, Nagesh Kharidi, Ingmar Kliche, Jim Larson, Raj Tumuluri, Brad Porter, Dave Raggett, T. V. Raman, B. Helena Rodriguez, Muthuselvam Selvaraj, Andrew Wahbe, Piotr Wiechno, Moshe Yudkowsky. Special thanks go to Kazuyuki Ashimura, the W3C Team Contact for the Multimodal Interaction Working Group, for his guidance through the W3C process and to Jim Barnett, the Editor-in-Chief of the Multimodal Architecture and Interfaces specification.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dahl, D.A. The W3C multimodal architecture and interfaces standard. J Multimodal User Interfaces 7, 171–182 (2013). https://doi.org/10.1007/s12193-013-0120-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-013-0120-5