Multimedia Systems

, Volume 14, Issue 5, pp 299–323 | Cite as

COSMOROE: a cross-media relations framework for modelling multimedia dialectics

  • Katerina PastraEmail author
Regular Paper


Though everyday interaction is predominantly multimodal, a purpose-developed framework for describing the semantic interplay between verbal and non-verbal communication is still lacking. This lack not only indicates one’s poor understanding of multimodal human behaviour, but also weakens any attempt to model such behaviour computationally. In this article, we present COSMOROE, a corpus-based framework for describing semantic interrelations between images, language and body movements. We argue that in viewing such relations from a message-formation perspective rather than a communicative goal one, one may develop a framework with descriptive power and computational applicability. We test COSMOROE for compliance to these criteria, by using it for annotating a corpus of TV travel programmes; we present all particulars of the annotation process and conclude with a discussion on the usability and scope of such annotated corpora.


Cross-media relations Image–language–movement interaction Multimedia semantics Multimedia dialectics Multimedia discourse 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    André, E., Rist, T.: The design of illustrated documents as a planning task. In: Maybury, M. (ed.) Intelligent Multimedia Interfaces, pp. 94–116, Chap. 4. AAAI Press/MIT Press, Cambridge, MA (1993)Google Scholar
  2. 2.
    André, E., Rist, T.: Referring to world objects with text and pictures. In: Proceedings of the Computational Linguistics Conference, pp. 530–534 (1994)Google Scholar
  3. 3.
    Barnard K., Duygulu P., Forsyth D., Freitas N., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)zbMATHCrossRefGoogle Scholar
  4. 4.
    Barras, C., Geoffrois, E., Wu, Z., Liberman, M.: Transcriber: a free tool for segmenting, labeling and transcribing speech. In: Proceedings of the First International Conference on Language Resources and Evaluation, pp. 1373–1376 (1998)Google Scholar
  5. 5.
    Barthes, R.: Image, Music, Text. Flamingo (1984)Google Scholar
  6. 6.
    Bateman, J., Delin, J., Allen, P.: Constraints on layout in multimodal document generation. In: Proceedings of the Workshop on Coherence in Generated Multimedia, First International Natural Language Generation Conference (2000)Google Scholar
  7. 7.
    Bateman, J., Delin, J., Henschel, R.: Multimodality and empiricism: preparing for a corpus-based approach to the study of multimodal meaning-making. In: Perspectives on Multimodality, pp. 65–89. John Benjamins, Amsterdam (2004)Google Scholar
  8. 8.
    Bernsen N.: Why are analogue graphics and natural language both needed in hci? In: Paterno, F. (ed.) Interactive Systems: Design, specification and verification. Focus on Computer Graphics, pp. 235–251. Springer, Berlin (1995)Google Scholar
  9. 9.
    Bordegoni M., Faconti G., Feiner S., Maybury M., Rist T., Ruggieri S., Trahanias P., Wilson M.: A standard reference model for intelligent multimedia presentation systems. Computer Standards Interfaces 18(6/7), 477–496 (1997)CrossRefGoogle Scholar
  10. 10.
    Carletta J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)Google Scholar
  11. 11.
    Carlson, L., Marcu, D., Okurowski, M.: Building a discourse-tagged corpus in the framework of rhetorical structure theory. In: Current Directions in Discourse and Dialogue, pp. 85–112. Kluwer, Dordrecht (2003)Google Scholar
  12. 12.
    de Carolis, B., Pelachaud, C., Poggi, I.: Verbal and nonverbal discourse planning, proceedings of fourth international conference on autonomous agents. In: Proceedings of the Workshop on Achieving Human-Like Behaviour in Interactive Animated Agents, Fourth International Conference on Autonomous Agents (2000)Google Scholar
  13. 13.
    Cassell, J.: A framework for gesture generation and interpretation. In: Computer Vision in Human–Machine Interaction, Chap. 11. Cambridge University Press, London (1998)Google Scholar
  14. 14.
    Chen, L., Liu, Y., Harper, M., Maia, E., McRoy, S.: Evaluating factors impacting the accuracy of forced alignments in a multimodal corpus. In: Proceedings of the 4th Language Resources and Evaluation Conference (2004)Google Scholar
  15. 15.
    Corio, M., Lapalme, G.: Integrated generation of graphics and text: a corpus study. In: Proceedings of the Association of Computational Linguistics Workshop on Content Visualisation and Intermedia Representation, pp. 63–68 (1998)Google Scholar
  16. 16.
    Corio, M., Lapalme, G.: Generation of texts for information graphics. In: Proceedings of the European Workshop on Natural Languge Generation, pp. 49–58 (1999)Google Scholar
  17. 17.
    Crewson P.: Fundamental of clinical research for radiologists: reader agreement studies. Am. J. Roentgenol. 184, 1391–1397 (2005)Google Scholar
  18. 18.
    Dasiopoulou, S., Papastathis, V., Mezaris, V., Kompatsiaris, I., Strintzis, M.: An ontology framework for knowledge-assisted semantic video analysis and annotation. In: Proceedings of the International Workshop on Knowledge Markup and Semantic Annotation (2004)Google Scholar
  19. 19.
    Everingham, M., Gool, L.V., Williams, C., Zisserman, A.: Pascal visual object classes challenge results. World Wide Web ( (2005)
  20. 20.
    Fasciano, M., Lapalme, G.: Intentions in the co-ordinated generation of graphics and text from tabular data. Knowl. Inform. Syst. 2(3) (2000)Google Scholar
  21. 21.
    Feiner, S., McKeown, K.: Automating the generation of co-ordinated multimedia explanations. In: Maybury, M. (ed.) Intelligent Multimedia Interfaces, pp. 117–138, chap. 5. AAAI Press/MIT Press, Cambridge, MA (1993)Google Scholar
  22. 22.
    Fellbaum,C. (ed.):WordNet:An Electronic Lexical Database. The MIT Press, Cambridge, MA (1998)zbMATHGoogle Scholar
  23. 23.
    Green, N.: An empirical study of multimedia argumentation. In: Proceedings of the International Conference on Computational Sciences-Part I, pp. 1009–1018. Springer, Berlin (2001)Google Scholar
  24. 24.
    Gut, U., Looks, K., Thies, A., Trippel, T., Gibbon, D.: Cogest conversational gesture transcription system. Tech. rep., University of Bielefeld (2002)Google Scholar
  25. 25.
    Jackendoff R.: Consciousness and the Computational Mind. MIT Press, Cambridge (1987)Google Scholar
  26. 26.
    Kendon A.: Gesture: Visible Action as Utterance. Cambridge University Press, London (2004)Google Scholar
  27. 27.
    Kipp, M.: Gesture generation by imitation—from human behavior to computer character animation. Boca Raton, Florida: (2004)Google Scholar
  28. 28.
    Kipp, M.: Spatiotemporal coding in anvil. In: Proceedings of the 6th Language Resources and Evaluation Conference (2008)Google Scholar
  29. 29.
    Lin, C., Tseng, B., Smith, J.: Video collaborative annotation forum: Establishing ground-truth labels on large multimedia datasets. TRECVID Proceedings (2003)Google Scholar
  30. 30.
    Lindley, C., Davis, J., Nack, F., Rutledge, L.: The application of rhetorical structure theory to interactive news program generation from digital archives. Technical Report INS-R0101, Centrum voor Wiskunde en Informatica (2001)Google Scholar
  31. 31.
    Magno-Caldognetto, E., Poggio, I., Cosi, P., Cavicchio, F., Merola, G.: Multimedia score—an anvil-based annotation scheme for multimodal audio-video analysis. In: Proceedings of the LREC Workshop on Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation Of Multimodal Input And Output Interfaces, pp. 29–33 (2004)Google Scholar
  32. 32.
    Mann W., Thompson S.: Rhetorical structure theory: description and construction of text structures. In: Kempen, G.(eds) Natural Language Generation: New results in Artificial Intelligence, Psychology and Linguistics, pp. 85–95. Nijhoff, Dodrecht (1987)Google Scholar
  33. 33.
    Marsh E., Domas-White M.: A taxonomy of relationships between image and text. J. Document. 59(6), 647–672 (2003)CrossRefGoogle Scholar
  34. 34.
    Martin, J., Grimard, S., Alexandri, K.: On the annotation of multimodal behavior and computation of cooperation between modalities. In: Proceedings of the International Conference on Autonomous Agents workshop on Representing, Annotating, Evaluating Non-verbal and Verbal Communicative Acts to Achieve Contextual Embodied Agents, pp. 1–7 (2001)Google Scholar
  35. 35.
    Martin, J., Julia, L., Cheyer, A.: A theoretical framework for multimodal user studies. In: Proceedings of the Second International Conference on Cooperative Multimodal Communication, pp. 104–110 (1998)Google Scholar
  36. 36.
    Martin, J., Kipp, M.: Annotating and measuring multimodal behaviour—tycoon metrics in the anvil tool. In: Proceedings of the Language Resources and Evaluation Conference 2002, pp. 31–35 (2002)Google Scholar
  37. 37.
    Martinec R., Salway A.: A system for image–text relations in new (and old) media. Vis. Commun. 4(3), 339–374 (2005)Google Scholar
  38. 38.
    Maybury, M. (ed.): Intelligent Multimedia Interfaces. AAAI Press/MIT Press, Cambridge, MA (1993)Google Scholar
  39. 39.
    Maybury, M.,Wahlster,W. (eds.): Intelligent User Interfaces. Morgan Kaufmann Publishers, San Francisco, CA (1998)Google Scholar
  40. 40.
    McNeil D.: Gesture and Thought. The University of Chicago Press, Chicago, IL (2005)Google Scholar
  41. 41.
    Minsky, M.: The Society of Mind. Simon and Schuster Inc., NY, USA (1986)Google Scholar
  42. 42.
    Moore J., Paris C.: Planning text for advisory dialogues: capturing intentional and rhetorical information. Comput. Linguist. 19(4), 651–695 (1993)Google Scholar
  43. 43.
    Moore J., Pollack M.: Problem for RST: the need for multi-level discourse analysis. Comput. Linguist. 18(4), 537–544 (1992)Google Scholar
  44. 44.
    Nicholas, N.: Parameters for rhetorical structure theory ontology. In: University of Melbourne Working Papers in Linguistics, vol. 15, pp. 77–93. University of Melbourne, Melbourne (1995)Google Scholar
  45. 45.
    Pastra, K.: The language of caricature: language and drawing interaction. Final year project, Department of Greek Philology and Linguistics, University of Athens (1999) (in Greek)Google Scholar
  46. 46.
    Pastra, K.: Viewing vision–language integration as a double-grounding case. In: Proceedings of the AAAI Fall Symposium on Achieving Human-Level Intelligence through Integrated Systems and Research, pp. 62–67 (2004)Google Scholar
  47. 47.
    Pastra, K.: Vision–language integration: a double-grounding case. Ph.D. thesis, University of Sheffield (2005)Google Scholar
  48. 48.
    Pastra, K.: Beyond multimedia integration: corpora and annotations for cross-media decision mechanisms. In: Proceedings of the 5th Language Resources and Evaluation Conference, pp. 499–504 (2006)Google Scholar
  49. 49.
    Pastra, K., Piperidis, S.: Video search: new challenges in the pervasive digital video era. J. Virtual Reality Broadcast. 3(11) (2006)Google Scholar
  50. 50.
    Pastra K., Saggion H., Wilks Y.: Intelligent indexing of crime-scene photographs. IEEE Intell. Syst. 18(1), 55–61 (2003)CrossRefGoogle Scholar
  51. 51.
    Pastra, K., Wilks, Y.: Vision–language integration in AI: a reality check. In: Proceedings of the 16th European Conference in Artificial Intelligence, pp. 937–941 (2004)Google Scholar
  52. 52.
    Radev, D.: A common theory of information fusion from multiple text sources. step one: cross document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, pp. 74–83 (2000)Google Scholar
  53. 53.
    Rocchi, C., Zancanaro, M.: Generation of video documentaries from discourse structures. In: Proceedings of the 9th European Workshop on Natural Language Generation (EWNLG 9) (2003)Google Scholar
  54. 54.
    Sanders T., Spooren W., Noordman L.: Toward a taxonomy of coherence relations. Discourse Process. 15, 1–35 (1992)CrossRefGoogle Scholar
  55. 55.
    Simou, N., Tzouvaras, V., Avrithis, Y., Stamou, G., Kollias, S.: A visual descriptor ontology for multimedia reasoning. In: Proceedings of the workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) (2005)Google Scholar
  56. 56.
    Srikanth, M., Varner, J., Bowden, M., Moldovan, D.: Exploiting ontologies for authomatic image annotation. In: Proceedings of the ACM Special Interest Group in Information Retrieval (SIGIR), pp. 552–558 (2005)Google Scholar
  57. 57.
    Taboada M., Mann W.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)CrossRefGoogle Scholar
  58. 58.
    Wachsmuth, S., Stevenson, S., Dickinson, S.: Towards a framework for learning structured shape models from text-annotated images. In: Proceedings of the HLT-NAACL Workshop on Learning Word Meaning from non-linguistic Data (2003)Google Scholar
  59. 59.
    Whittaker, S., Walker, M.: Toward a theory of multi-modal interaction. In: Proceedings of the National Conference on Artificial Intelligence Workshop on Multi-modal Interaction (1991)Google Scholar

Copyright information

© Springer-Verlag 2008

Authors and Affiliations

  1. 1.Department of Language Technology ApplicationsInstitute for Language and Speech ProcessingAthensGreece

Personalised recommendations