Skip to main content

Research in Multimedia and Multimodal Parsing and Generation

  • Chapter
Integration of Natural Language and Vision Processing

Abstract

This overview introduces the emerging set of techniques for parsing and generating multiple media (e.g., text, graphics, maps, gestures) using multiple sensory modalities (e.g., auditory, visual, tactile). We first briefly introduce and motivate the value of such techniques. Next we describe various computational methods for parsing input from heterogeneous media and modalities (e.g., natural language, gesture, gaze). We subsequently overview complementary techniques for generating coordinated multimedia and multimodal output. Finally, we discuss systems that have integrated both parsing and generation to enable multimedia dialogue in the context of intelligent interfaces. The article concludes by outlining fundamental problems which require further research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allen, J. (1987). Natural Language Understanding. Benjamin Cummings: Reading, MA.

    Google Scholar 

  • André, E. & Rist, T. (1993). The Design of Illustrated Documents as a Planning Task. In (Maybury 1993), 94–116. Also DFKI Research Report RR-92-45.

    Google Scholar 

  • André, E., Finkler, W., Graf, W., Rist, T., Schauder, A. & Wahlster, W. (1993). WIP: The Automatic Synthesis of Multimodal Presentations. In (Maybury 1993), 73–90. Also DFKI Research Report RR-92-46.

    Google Scholar 

  • Arens, Y., Miller, L. & Sondheimer, N. K. (1991). Presentation Design Using an Integrated Knowledge Base. In (Sullivan and Tyler 1991), 241–258.

    Google Scholar 

  • Proceedings of the ARPA Human Language Technology Workshop, March 1993. Morgan Kaufman: San Francisco.

    Google Scholar 

  • Austin, J. (1962). How to do Things with Words, J. O. Urmson (ed.), Oxford University Press: England.

    Google Scholar 

  • Blattner, M. M. & Dannenberg, R. B. (eds.). (1992). Multimedia Interface Design, ACM Press/Addison-Wesley: Reading, MA.

    MATH  Google Scholar 

  • Bonarini, A. (1993). Modeling Issues in Multimedia Car-Driver Interaction. In (Maybury 1993), 353–371.

    Google Scholar 

  • Brachman, R. J. & Schmolze, J. G. (1985). An Overview of the KL-ONE Knowledge Representation System. Cognitive Science 9(2): 171–216.

    Article  Google Scholar 

  • Burger, J. & Marshall, R. (1993). The Application of Natural Language Models to Intelligent Multimedia. In (Maybury 1993), 167–187.

    Google Scholar 

  • Buxton, W. & Myers, B. A. (1986). A Study in Two-Handed Input. Proceedings of Human Factors in Computing Systems (CHI-86), 321–326, ACM: New York.

    Google Scholar 

  • Buxton, W., Bly, S., Frysinger, S., Lunney, D., Mansur, D., Mezrich, J. & Morrison, R. (1985). Communicating with Sound. Proceedings of The Human Factors in Computing Systems (CHI-85), 115–119, New York.

    Google Scholar 

  • Buxton, W. (ed.). (1989). Human-Computer Interaction 4: Special Issue on Nonspeech Audio, Lawrence Erlbaum.

    Google Scholar 

  • Buxton, W., Gaver, W. & Bly, S. Auditory Interfaces: The use of Non-speech Audio at the Interface. Cambridge University Press (in press).

    Google Scholar 

  • Carbonell, J. R. (1970). Mixed-Initiative Man-Computer Dialogues. Bolt, Beranek and Newman (BBN) Report No. 1971, Cambridge, MA.

    Google Scholar 

  • Cornell, M., Woolf, B. & Suthers, D. (1993). Using ‘Live Information’ in a Multimedia Framework. In (Maybury 1993), 307–327.

    Google Scholar 

  • Dale, R., Mellish, C. & Zock, M. (eds.). (1990). Current Research in Natural Language Generation. Based on Extended Abstracts from the Second European Workshop on Natural Language Generation, University of Edinburgh, Edinburgh, Scotland, 6–8 April, 1989. London: Academic Press. ISBN 0-12-200735-2, 356 pp.

    Google Scholar 

  • Dale, R., Hovy, E. Rösner, D. & Stock, O. (eds.). (1992). Aspects of Automated Natural Language Generation, Lecture Notes in Computer Science, 587. Proceedings of The 6th International Workshop on Natural Language Generation, Trento, Italy, April 5–7, 1992. Springer-Verlag: Berlin.

    Google Scholar 

  • Fallside, F. & Woods, W. (eds.). (1985). Computer Speech Processing. Prentice Hall: Englewood Cliffs, NJ. Contributions by speakers at an advanced course on computer speech processing held at the University of Cambridge in 1983.

    Google Scholar 

  • Feiner, S. (1985). APEX: An Experiment in the Automated Creation of Pictorial Explanations. IEEE Computer Graphics and Application 5(11): 29–37.

    Article  Google Scholar 

  • Feiner, S. (1988). A Grid-based Approach to Automating Display Layout. Proceedings of The Graphics Interface, 192–197. Morgan Kaufmann: Los Angeles.

    Google Scholar 

  • Feiner, S. K. & McKeown, K. R. (1993). Automating the Generation of Coordinated Multimedia Explanations. In (Maybury 1993), 113–134.

    Google Scholar 

  • Feiner, S. K., Litman, D. J., McKeown, K. R. & Passonneau, R. J. (1993). Towards Coordinated Temporal Multimedia Presentations. In (Maybury 1993), 139–147.

    Google Scholar 

  • Feiner, S., Mackinlay, J. & Marks, J. 1992. Automating the Design of Effective Graphics for Intelligent User Interfaces. Tutorial Notes. Human Factors in Computing Systems, CHI-92, Monterey.

    Google Scholar 

  • Goodman, B. A. (1993). Multimedia Explanations for Intelligent Training Systems. In (Maybury 1993), 148–171.

    Google Scholar 

  • Gray, W. D., Hefley, W. E. & Murray, D. (eds.). (1993). In Proceedings of The 1993 International Workshop on Intelligent User Interfaces. Orlando, FL January, 1993. ACM: New York.

    Google Scholar 

  • Graf, W. (1992). Constraint-based Graphical Layout of Multimodal Presentations. In (Catarci, Costabile, and Levialdi 1992), 365–385. Also available as DFKI Report RR-92-15.

    Google Scholar 

  • Graf, W. (1994) Semantik-gesteuertes Layout-Design multimodaler Prasentationen, Ph.D. diss., Technische Fakultät, Universitat des Saarlandes, Saarbriicken, Germany.

    Google Scholar 

  • Grosz, B. J., Sparck Jones, K. & Webber, B. L. (eds.). (1986). Readings in Natural Language Processing. Morgan Kaufmann: Los Altos.

    Google Scholar 

  • Horacek, H. & Zock, M. (eds.). (1993). New Concepts in Natural Language Generation: Planning, Realization and Systems. Frances Pinter, London and New York.

    Google Scholar 

  • Hovy, E. H. & Arens, Y. (1991). Automatic Generation of Formatted Text. In Proceedings of The Ninth National Conference of the American Association for Artificial Intelligence, 92–91, Anaheim, CA.

    Google Scholar 

  • Hovy, E. H. & Arens, Y. (1993). On the Knowledge Underlying Multimedia Presentations. In (Maybury 1993), 280–306.

    Google Scholar 

  • Jacob, R. J. K. (1990). What You Look at is What You Get: Eye Movement-Based Interaction Techniques. In Proceedings of The Human Factors in Computing Systems (CHI ‘90), 11–18. ACM Press: New York. Seattle, April 1–5.

    Google Scholar 

  • Kempen, G. (ed.). (1987). Natural Language Generation: New Results in Artificial Intelligence, Psychology, and Linguistics, Martinus Nijhoff. NATO ASI Series: Dordrecht.

    Google Scholar 

  • Kobsa, A. & Wahlster, W. (eds.). (1989). User Models in Dialog Systems. Springer-Verlag: Berlin.

    Google Scholar 

  • Kobsa, A., Allgayer, J., Reddig, C., Reithinger, N., Schmauks, D., Harbush, K. & Wahlster, W. 1986. Combining Deictic Gestures and Natural Language for Referent Identification. Proceedings of The 11th International Conference on Computational Linguistics, 356–361, Bonn, West Germany.

    Google Scholar 

  • Koons, D. B., Sparrell, C. J. & Thorisson, K. R. (1993). Integrating Simultaneous Output from Speech, Gaze, and Hand Gestures. In (Maybury 1993), 243–261.

    Google Scholar 

  • Krause, J. (1993). A Multilayered Empirical Approach to Multimodality: Towards Mixed Solutions of Natural Language and Graphical Interfaces. In (Maybury 1993), 312–336.

    Google Scholar 

  • Mackinlay, J. D. (1986). Automating the Design of Graphical Presentations of Relational Information. ACM Transactions on Graphics 5(2): 110–141.

    Article  Google Scholar 

  • Marks, J. W. (1991). Automating the Design of Network Diagrams. Ph.D. thesis, Harvard University, Cambridge, MA.

    Google Scholar 

  • Marks, J. (1991). A Formal Specification Scheme for Network Diagrams that Facilitates Automated Design. Journal of Visual Languages and Computing 2(4): 395–414.

    Article  Google Scholar 

  • Marti, P., Profili, M., Raffaelli, P. & Toffoli, G. (1992). Graphics, Hyperqueries, and Natural Language: an Integrated Approach to User-Computer Interfaces. In (Catarci, Costabile, and Levialdi, 1992), 68–84.

    Google Scholar 

  • Maybury, M. T. (1990). Planning Multisentential English Text using Communicative Acts. Ph.D. diss., University of Cambridge, England. Available as Rome Air Development Center TR 90–411, In-House Report, December 1990 or as Cambridge University Computer Laboratory TR-239, December, 1991.

    Google Scholar 

  • Maybury, M. T. (1991). Planning Multimedia Explanations Using Communicative Acts. In Proceedings of The Ninth National Conference on Artificial Intelligence, 61–66. AAAI: Anaheim, CA.

    Google Scholar 

  • Maybury, M. T. (ed.). (1993). Intelligent Multimedia Interfaces. AAAI/MIT Press: Menlo Park.

    Google Scholar 

  • Maybury, M. T. (1994). Knowledge Based Multimedia: The Future of Expert Systems and Multimedia. International Journal of Expert Systems with Applications. Special issue on Expert Systems Integration with Multimedia Technologies 7(3), 387–396.

    Google Scholar 

  • Ragusa, J. (ed.). (1994). Knowledge Based Multimedia: The Future of Expert Systems and Multimedia. Elsevier Science.

    Google Scholar 

  • Maybury, M. T. (1994). Automated Explanation and Natural Language Generation. In Computational Text Generation. Bibliography. Sabourin, C. (ed.), Montreal: Infolingua, 1–88.

    Google Scholar 

  • Maybury, M. T. (in press). Communicative Acts for Multimedia and Multimodal Dialogue. In Taylor, M. M., Néel, F. & Bouwhuis, D. G. The Structure of Multimodal Dialogue. North-Holland: London. ISSN 1018–4554. Proceedings from workshop at Acquafredda di Maratea, Italy. September 16–20, 1991.

    Google Scholar 

  • Neal, J. G. & Shapiro, S. C. 1991. Intelligent Multi-Media Interface Technology. In (Sullivan and Tyler 1991), 11–43.

    Google Scholar 

  • Paris, C. L., Swartout, W. R. & Mann, W. C. (eds.). (1991). Natural Language Generation in Artificial Intelligence and Computational Linguistics. Kluwer: Norwell, MA.

    MATH  Google Scholar 

  • Pelachaud, C. (1992). Functional Decomposition of Facial Expressions for an Animation System. In (Catarci, Costabile, and Levialdi 1992), 26–49.

    Google Scholar 

  • Pentland, A. (ed.). (1993). Proceedings of IJCAI Special Workshop #3, Looking at People: Recognition and Interpretation of Human Action. Held in conjunction with 13th IJCAI, Chambrey, Savoie France, 28 August-3 September, 1993.

    Google Scholar 

  • Rabiner, L. R. & Schafer, R. W. (eds.). Digital Processing of Speech Signals. Prentice Hall: Englewood Cliffs, NJ.

    Google Scholar 

  • Rimé, B. & Schiaratura, L. (1991). Gesture and Speech. In Feldman, R. S. & Rim, B. (eds.) Fundamentals of Nonverbal Behavior, 239–281. New York: Press Syndicate of the University of Cambridge.

    Google Scholar 

  • Reiter, E., Mellish, C. & Levine, J. (1992). Automatic Generation of on-line Documentation in the IDAS Project. Proceedings of the 3rd Conference on Applied Natural Language Processing, 31 March-3 April 1992, Trento, Italy. Association of Computional Linguistics: Morristown, NJ.

    Google Scholar 

  • Roe, D. B. & Wilpon, J. S. (eds). (to appear). Proceedings of The National Academy of Sciences Colloquium on Human Machine Communication by Voice, National Academy of Sciences Press: Washington, DC.

    Google Scholar 

  • Roth, S. F. & Mattis, J. 1990. Data Characterization for Intelligent Graphics Presentation. In Proceedings of The 1990 Conference on Human Factors in Computing Systems, 193–200. New Orleans, Louisiana. ACM/SIGCHI.

    Google Scholar 

  • Roth, S. F. & Mattis, J. 1991. Automating the Presentation of Information. In Proceedings of The IEEE Conference on AI Applications, 90–97. Miami Beach, FL.

    Google Scholar 

  • Roth, S. F., Mattis, J. & Mesnard, X. (1991). Graphics and Natural Language Generation as Components of Automatic Explanation. In (Sullivan and Tyler 1991), 207–239.

    Google Scholar 

  • Schwanauer, S. & Levitt, D. (eds). (1993). Machine Models of Music. MIT Press: Cambridge, MA.

    Google Scholar 

  • Searle, J. R. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press: London.

    Book  Google Scholar 

  • Stein, A., Thiel, U. & Tissen, A. (1992). Knowledge based Control of Visual Dialogues in Information Systems. In Catarci, T., Costabile, M. F. & Levialdi, S. (eds.) 1992. Advanced Visual Interfaces: Proceedings of the International Workshop AVI’92, Singapore: World Scientific Series in Computer Science, Vol. 36, 138–155.

    Google Scholar 

  • Proceedings of the Third International Workshop on Frontiers of Handwriting Recognition (IWFHR III). Buffalo, NY. May 25–27, 1993.

    Google Scholar 

  • Stein, A. & Tissen, A. (1993). A Conversational Model of Multimodal Interaction in Information Systems. In Proceedings of The Eleventh National Conference on Artificial Intelligence, 283–288. AAAI/MIT Press: Washington, DC.

    Google Scholar 

  • Stock, O. & the ALFRESCO Project Team (1993). ALFRESCO: Enjoying the Combination of Natural Language Processing and Hypermedia for Information Exploration. In Maybury, M. (ed.) Intelligent Multimedia Interfaces, 197–224. AAAI/MIT Press: Menlo Park.

    Google Scholar 

  • Sullivan, J. W. & Tyler, S. W. (eds.). (1991) Intelligent User Interfaces. Frontier Series. New York: ACM Press.

    MATH  Google Scholar 

  • Taylor, M. & Bouwhuis, D. G. (eds). (1989) The Structure of Multimodal Dialogue. Elsevier Science Publishers: Amsterdam.

    Google Scholar 

  • Thorisson, K., Koons, D. & Bolt, R. (1992). Multi-modal Natural Dialogue. In Proceedings of Computer Human Interaction (CHI-92), 653–654.

    Google Scholar 

  • Wahlster, W. (1991). User and Discourse Models for Multimodal Communication. In (Sullivan and Tyler, 1991), 45–67.

    Google Scholar 

  • Waibel, A. & Lee, K. (eds.) (1990). Readings in Speech Recognition, Morgan Kaufmann: San Mateo, CA.

    Google Scholar 

  • Wittenburg, K. (1993). Multimedia and Multimodal Parsing: Tutorial Notes. 31st Annual Meeting of the ACL, Columbus, Ohio, 23, June, 1993.

    Google Scholar 

Books

  • Taylor, M. & Bouwhuis, D. G. (eds.). (1989). The Structure of Multimodal Dialogue. Elsevier Science Publishers: Amsterdam.

    Google Scholar 

  • Sullivan, J. W. & Tyler, S. W. (eds). (1991). Intelligent User Interfaces. Frontier Series. ACM Press: New York.

    MATH  Google Scholar 

  • Blattner, M. M. & Dannenberg, R. B. (eds.). (1992). Multimedia Interface Design. ACM Press/Addison-Wesley: Reading, MA.

    MATH  Google Scholar 

  • Catarci, T., Costabile, M. F. & Levialdi, S. (eds.). (1992). Advanced Visual Interfaces: Proceedings of The International Workshop AVI’92, Singapore: World Scientific Series in Computer Science, Vol. 36.

    Google Scholar 

  • Buxton, W., Gaver, W. & Bly, S. (in press). Auditory Interfaces: The use of Non-speech Audio at the Interface. Cambridge University Press.

    Google Scholar 

  • Maybury, M. (ed.). (1993). Intelligent Multimedia Interfaces. AAAI/MIT Press: Cambridge, MA.

    Google Scholar 

Workshop/Conference Proceedings

  • Neches, B. & Kaczmarek, T. (1986). Working Notes from the AAAI Workshop on Intelligence in Interfaces, August 14, 1986. AAAI: Menlo Park.

    Google Scholar 

  • Arens, Y., Feiner, S., Hollan, J. & Neches, B. (eds.). (1989). Workshop Notes from the IJCAI-89 Workshop on A New Generation of Intelligent Interfaces. Detroit, MI, 22 August.

    Google Scholar 

  • Maybury, M. T. (ed.). (1991). Working Notes from the AAAI Workshop on Intelligent Multimedia Interfaces. Ninth National Conference on Artificial Intelligence. 15 July, Anaheim, CA. AAAI: Menlo Park.

    Google Scholar 

  • Taylor, M., Bouwhuis, D. G. & Neél, F. (eds.). (1991) Pre-proceedings of the Second Venaco Workshop on The Structure of Multimodal Dialogue, Acquafredda di Maratea, Italy, September, 1991.

    Google Scholar 

  • Gray, W. D. Hefley, W. E. & Murray, D. (eds.). (1993). Proceedings of The 1993 International Workshop on Intelligent User Interfaces, Orlando, FL January, 1993. ACM: New York.

    Google Scholar 

  • Johnson, P., Marks, J., Maybury, M., Moore, J. & Feiner, S. (organizing committee) Working notes from The AAAI 1994 Spring Symposium on Intelligent Multimedia and Multimodal Systems, Stanford, CA, March 21–24, 1994.

    Google Scholar 

Tutorials/Overviews

  • Wittenburg, K. (1993). Multimedia and Multimodal Parsing: Tutorial Notes. 31st Annual Meeting of the ACL, Columbus, Ohio, 23 June, 1993.

    Google Scholar 

  • Feiner, S., Mackinlay, J. & Marks, J. (1992). Automating the Design of Effective Graphics for Intelligent User Interfaces. Tutorial Notes. Human Factors in Computing Systems, CHI-92, Monterey, May 4, 1992.

    Google Scholar 

  • Wahlster, W. (1993). Planning Multimodal Discourse. Invited Talk. Association for Computational Linguistics, Annual Meeting, Ohio State Univ., Columbus, Ohio, 24 June 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Maybury, M.T. (1995). Research in Multimedia and Multimodal Parsing and Generation. In: Mc Kevitt, P. (eds) Integration of Natural Language and Vision Processing. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-0445-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-94-011-0445-6_6

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-010-4199-7

  • Online ISBN: 978-94-011-0445-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics