Skip to main content

Extensible Multimodal Annotation for Intelligent Interactive Systems

  • Chapter
  • First Online:
Multimodal Interaction with W3C Standards

Abstract

Multimodal interactive systems enabling combination of natural modalities such as speech, touch, and gesture make it easier and more effective for users to interact with applications and services, whether on mobile devices, or in smart homes or cars. However, building these systems remains a complex and highly specialized task, in part because of the need to integrate multiple disparate and distributed system components. This task is further hindered by proprietary representations for input and output to different types of modality processing components such as speech recognizers, gesture recognizers, natural language understanding components and dialog managers. The W3C EMMA standard addresses this challenge and simplifies multimodal application authoring by providing a common representation language for capturing the interpretation of user inputs and system outputs and associated metadata. In this chapter, we describe the EMMA markup language and demonstrate its capabilities through presentation of a series of illustrative examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The W3C recommendation EMMA 1.0 only addresses inputs. Proposals for EMMA 2.0 extend the standard to represent output processing.

  2. 2.

    The EMMA language does not require the confidence score to be a probability and there is no expectation or requirement that confidence values are comparable across different producers of EMMA other than that values closer to 1 are higher in confidence while values closer to 0 are lower in confidence.

References

  1. Hauptmann, A. (1989). Speech and gesture for graphic image manipulation. In Proceedings of CHI’89, Austin, TX, pp. 241–245.

    Google Scholar 

  2. Nishimoto, T., Shida, N., Kobayashi, T., & Shirai, K. (1995). Improving human interface in drawing tool using speech, mouse, and keyboard. In Proceedings of the 4th IEEE International Workshop on Robot and Human Communication, ROMAN95, Tokyo, Japan, pp. 107–112.

    Google Scholar 

  3. Oviatt, S. L. (1999). Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the Conference on Human Factors in Computing Systems: CHI’99, Pittsburgh, PA, pp. 576–583.

    Google Scholar 

  4. Cohen, P. R. (1992). The role of natural language in a multimodal interface. In Proceedings of the 5th Annual ACM Symposium on User Interface Software and Technology, Monterey, CA, ACM Press, New York, NY, pp. 143–149

    Google Scholar 

  5. Rudnicky, A., & Hauptman, A. (1992). Multimodal interactions in speech systems. In M. Blattner & R. Dannenberg (Eds.), Multimedia interface design (pp. 147–172). New York: ACM Press.

    Google Scholar 

  6. Oviatt S., & VanGent, R. (1996). Error resolution during multimodal human-computer interaction. In Proceedings of International Conference on Spoken Language Processing (ICSLP),Philadelphia, PA, USA, pp. 204–207.

    Google Scholar 

  7. Allgayer, J., Jansen-Winkeln, R. M., Reddig, C., & Reithinger, N. (1989). Bidirectional use of knowledge in the multi-modal NL access system XTRA. In Proceedings of IJCAI, Detroit, MI, USA, pp. 1492–1497.

    Google Scholar 

  8. Bangalore, S., & Johnston, M. (2000). Tight-coupling of multimodal language processing with speech recognition. In Proceedings of the International Conference on Spoken Language Processing, Beijing, pp. 126–129.

    Google Scholar 

  9. Bolt, R. A. (1980). “Put-That-There”: Voice and gesture at the graphics interface. Computer Graphics, 14(3), 262–270.

    Article  MathSciNet  Google Scholar 

  10. Chai, J., Hong, P., & Zhou, M. (2004). A probabilistic approach to reference resolution in multimodal user interfaces. In Proceedings of 9th International Conference on Intelligent User Interfaces (IUI), Madeira, pp. 70–77.

    Google Scholar 

  11. Cohen, P. R., Johnston, M., McGee, D., Oviatt, S. L., Pittman, J., Smith, I., et al. (1997). Multimodal interaction for distributed interactive simulation. In Proceedings of Innovative Applications of Artificial Intelligence Conference. Menlo Park: AAAI/MIT Press.

    Google Scholar 

  12. House, D., & Wirn, M. (2000). Adapt—A multimodal conversational dialogue system in an apartment domain. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Beijing, pp. 134–137.

    Google Scholar 

  13. Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., et al. (2002). MATCH: An architecture for multimodal dialog systems. In Proceedings of the Association of Computational Linguistics Annual Conference, Philadelphia, PA, pp. 376–383.

    Google Scholar 

  14. Koons, D. B., Sparrell, C. J., & Thorisson, K. R. (1993). Integrating simultaneous input from speech, gaze, and hand gestures. In M. T. Maybury (Ed.), Intelligent multimedia interfaces (pp. 257–276). Cambridge, MA: AAAI Press/MIT Press.

    Google Scholar 

  15. Neal, J. G., & Shapiro, S. C. (1991). Intelligent multi-media interface technology. In J. W. Sullivan & S. W. Tyler (Eds.), Intelligent user interfaces (pp. 45–68). New York: Addison Wesley.

    Google Scholar 

  16. Sharma, R., Yeasin, M., Krahnstoever, N., Rauschert, I., Cai, G., Brewer, I., MacEachren, A. M., & Sengupta, K. (2003). Speech-gesture driven multimodal interfaces for crisis management. Proceedings of the IEEE, 91(9), 1327–1354.

    Article  Google Scholar 

  17. Wahlster, W. (2002). SmartKom: Fusion and fission of speech, gestures, and facial expressions. In Proceedings of the 1st International Workshop on Man-Machine Symbiotic Systems, Kyoto, pp. 213–225.

    Google Scholar 

  18. Waibel, A., Vo, M., Duchnowski, P., & Manke, S. (1996). Multimodal interfaces. AI Review Journal, 10:299–319.

    Google Scholar 

  19. Wauchope, K. (1994). Eucalyptus: Integrating natural language input with a graphical user interface. Naval Research Laboratory, Report NRL/FR/5510-94-9711.

    Google Scholar 

  20. Cohen, P. R., Kaiser, E. C., Buchanan, C. M., & Lind, S. (2015). Sketch-Thru-Plan: A multimodal interface for command and control. Communications of ACM, 58(4), 56–65.

    Article  Google Scholar 

  21. Johnston, M., & Ehlen, P. (2010). Speak4itSM: Multimodal interaction in the wild. In Proceedings of IEEE Spoken Language Technology Workshop, Berkeley, CA, pp. 59–60.

    Google Scholar 

  22. Johnston, M., Baggia, P., Burnett, D. C., Carter, J., Dahl, D. A., McCobb, G., et al. (2009). EMMA:Extensible MultiModal Annotation markup language. W3C Recommendation. https://www.w3.org/TR/2009/REC-emma-20090210/.

  23. Johnston, M., Dahl, D. A., Denney, T., & Kharidi, N. (2015). EMMA: Extensible MultiModal Annotation markup language Version 2.0. W3C Public working draft. https://www.w3.org/TR/emma20/. Accessed Sept 2015.

  24. Barnett, J., Bodell, M., Dahl, D., Kliche, I., Larson, J., Porter, B., et al. (2012). Multimodal architecture and interfaces. W3C Recommendation. https://www.w3.org/TR/mmi-arch/.

  25. Dahl, D. (2000). Natural language semantics markup language for the speech interface framework. W3C Working Draft. https://www.w3.org/TR/2000/WD-nl-spec-20001120/.

  26. Shanmugham, S., Monaco, P., & Eberman, B. (2006). A media resource control protocol (MRCP). IETF RFC 4463. https://tools.ietf.org/html/rfc4463.

  27. Burnett, D., & Shanmugham, S. (2012). Media resource control protocol Version 2 (MRCPv2). IETF RFC 6787. https://tools.ietf.org/html/rfc6787.

  28. Phillips, A., & Davis, M. (2006). Tags for the Identification of Languages, IETF. http://www.rfc-editor.org/rfc/bcp/bcp47.txt.

  29. Chee, Y.-M., Franke, K., Froumentin, M., Madhvanath, S., Magana, J. A., Pakosz, G., et al. (2011). Ink markup language (InkML). W3C Recommendation. https://www.w3.org/TR/InkML/.

  30. Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., et al. (2003). Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality. In Proceedings of the 5th International Conference on Multimodal Interfaces (ICMI), Vancover, BC, Canada, pp. 12–19.

    Google Scholar 

  31. Jonson, R. (2006). Dialog context-based re-ranking of ASR hypotheses. In Proceedings of IEEE Spoken Language Technology Workshop, Palm Beach, Aruba, pp. 174–177.

    Google Scholar 

  32. Johnston, M. (2009). Building multimodal applications with EMMA. In Proceedings of the ICMI Conference, Boston, MA, USA, pp. 47–54.

    Google Scholar 

  33. Mishra, T., & Bangalore, S. (2011). Finite-state models for speech-based search on mobile devices. Natural Language Engineering, 17(2), 243–264.

    Article  Google Scholar 

  34. Stoyanchev, S., & Johnston, M. (2015). Localized error detection for targeted clarification in a virtual assistant. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Bribane, Australia, pp. 5241–5245.

    Google Scholar 

  35. Stoyanchev, S., Liu, A., & Hirschberg, J. (2014). Towards natural clarification questions in dialogue systems. In Proceedings of AISB, London, England.

    Google Scholar 

  36. Johnston, M., & Bangalore, S. (2009). Robust understanding in multimodal interfaces. Computational Linguistics, 35(3), 345–397.

    Article  Google Scholar 

  37. Johnston, M. (1998). Unification-based multimodal parsing. In Proceedings of the Association for Computational Linguistics Annual Conference (ACL). Montreal, pp. 624–630.

    Google Scholar 

  38. Pustejovsky, J., Castaño, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., et al. (2003). TimeML: Robust specification of event and temporal expressions in text. In Proceedings of Fifth International Workshop on Computational Semantics (IWCS-5), Tilburg, The Netherlands.

    Google Scholar 

  39. Burnett, D., Walker, M. R., & Hunt, A. (2004). Speech synthesis markup language (SSML) Version 1.0. W3C Recommendation. https://www.w3.org/TR/speech-synthesis/.

  40. ECMA International (2013). The JSON data interchange format. Standard ECMA-404. http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf. Accessed 14 May 2016.

  41. Van Tichelen, L., & Burke, D. (2007). Semantic interpretation for speech recognition (SISR) Version 1.0. W3C Recommendation. https://www.w3.org/TR/semantic-interpretation/.

  42. Hunt, A., & McGlashan, S. (2004). Speech recognition grammar specification Version 1.0. W3C Recommendation. https://www.w3.org/TR/speech-grammar/.

Download references

Acknowledgements

I would like to acknowledge the many contributors to the EMMA standard from around the world including Deborah Dahl, Kazuyuki Ashimura, Paolo Baggia, Roberto Pieraccini, Dan Burnett, Dave Raggett, Stephen Potter, Nagesh Kharidi, Raj Tumuluri, Jerry Carter, Wu Chou, Gerry McCobb, Tim Denney, Max Froumentin, Katrina Halonen, Jin Liu, Massimo Romanelli, T. V. Raman, and Yuan Shao.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Johnston .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Johnston, M. (2017). Extensible Multimodal Annotation for Intelligent Interactive Systems. In: Dahl, D. (eds) Multimodal Interaction with W3C Standards. Springer, Cham. https://doi.org/10.1007/978-3-319-42816-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-42816-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-42814-7

  • Online ISBN: 978-3-319-42816-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics