Multimedia Tools and Applications

, Volume 70, Issue 2, pp 1277–1308 | Cite as

CASAM: collaborative human-machine annotation of multimedia

  • Robert J. Hendley
  • Russell Beale
  • Chris P. Bowers
  • Christos Georgousopoulos
  • Charalampos Vassiliou
  • Petridis Sergios
  • Ralf Moeller
  • Eric Karstens
  • Dimitris Spiliotopoulos


The CASAM multimedia annotation system implements a model of cooperative annotation between a human annotator and automated components. The aim is that they work asynchronously but together. The system focuses upon the areas where automated recognition and reasoning are most effective and the user is able to work in the areas where their unique skills are required. The system’s reasoning is influenced by the annotations provided by the user and, similarly, the user can see the system’s work and modify and, implicitly, direct it. The CASAM system interacts with the user by providing a window onto the current state of annotation, and by generating requests for information which are important for the final annotation or to constrain its reasoning. The user can modify the annotation, respond to requests and also add their own annotations. The objective is that the human annotator’s time is used more effectively and that the result is an annotation that is both of higher quality and produced more quickly. This can be especially important in circumstances where the annotator has a very restricted amount of time in which to annotate the document. In this paper we describe our prototype system. We expand upon the techniques used for automatically analysing the multimedia document, for reasoning over the annotations generated and for the generation of an effective interaction with the end-user. We also present the results of evaluations undertaken with media professionals in order to validate the approach and gain feedback to drive further research.


Annotation Synergistic Collaborative Human Artificial Intelligence Ontology Video 



This work was supported by the European commission and partly funded through project FP7-217061. We would like to thank all members of the CASAM project team who contributed to the results of this work, and to all the users who gave their time and comments.


  1. 1.
    Abowd GD, Gauger M, Lachenmann A (2003) The Family video archive: an annotation and browsing environment for home movies. Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval - MIR ’03. pp. 1–8 ACM Press, Berkeley, CaliforniaGoogle Scholar
  2. 2.
    Anthimopoulos M, Vlissidis N, Gatos B (2010) A pixel-based evaluation method for text detection in color images. Proceding of 20th International Conference on Pattern Recognition (ICPR), pp. 3264–3267, 23-26 Aug. 2010Google Scholar
  3. 3.
    Bailey BP, Konstan JA, Carlis JV (2000) Measuring the effects of interruptions on task performance in the user interface. SMC 2000 Conference Proceedings 2000 IEEE International Conference on Systems Man and Cybernetics Cybernetics Evolving to Systems Humans Organizations and their Complex Interactions Cat No00CH37166. 2, 757–762Google Scholar
  4. 4.
    Bargeron D, Gupta A, Grudin J, Sanocki E (1999) Annotations for streaming video on the web. In: CHI '99 Extended Abstracts on Human Factors in Computing Systems (CHI EA '99). ACM, New York, NY, USA, pp. 278–279Google Scholar
  5. 5.
    Bowers C, Byrne W, Cowan BR, Creed C, Hendley RJ, Beale R (2011) Choosing your moment: interruptions in multimedia annotation. Human-Computer Interaction–INTERACT 2011. pp. 438–453 SpringerGoogle Scholar
  6. 6.
    Burr B (2006) VACA: a tool for qualitative video analysis. In: CHI '06 Extended Abstracts on Human Factors in Computing Systems (CHI EA '06). ACM, New York, NY, USA, pp. 622–627Google Scholar
  7. 7.
    Cavnar WB, Trenkle JM (1994) N-Gram-Based text categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Ann Arbor MI. 48113, 2, 4001. pp. 161–175Google Scholar
  8. 8.
    Chen L, Chena GC, Xua CZ, March J, Benford S (2007) EmoPlayer: A media player for video clips with affective annotations. Interact Comput 20:17–28Google Scholar
  9. 9.
    Cherry G, Fournier J, Reed S (2003) Using a Digital Video Annotation Tool to Teach Dance Composition. Interact Multimedia Electron J of Comput-Enhanc Learn 5:1Google Scholar
  10. 10.
    Correia N, Cabral D (2006) Interfaces for Video Based Web Lectures. Sixth IEEE International Conference on Advanced Learning Technologies ICALT06. 634–638 IEEE Computer SocietyGoogle Scholar
  11. 11.
    Costa M, Correia N, Guimarães N (2002) Annotations as multiple perspectives of video content. Proceedings of the tenth ACM international conference on Multimedia MULTIMEDIA 02. pp. 283–286 ACM PressGoogle Scholar
  12. 12.
    Creed C, Bowers CP, Hendley RJ, Beale R (2010) User perception of interruptions in multimedia annotation tasks. Proceedings of the 6th Nordic Conference on Human-Computer Interaction: Extending Boundaries. pp. 619–622 ACMGoogle Scholar
  13. 13.
    Creed, C., Lonsdale, P, Hendley R, Beale R (2010) Synergistic Annotation of Multimedia Content. Proc. 3rd International Conference on Advances in Computer-Human Interactions. pp. 205–208 IEEE Google Scholar
  14. 14.
    Domingos P, Richardson M (2004) Markov Logic: A Unifying Framework for Statistical Relational Learning. Engineering 10:49–54Google Scholar
  15. 15.
    Espinosa S (2011) Content Management and Knowledge Management: Two Faces of Ontology-Based Text Interpretation. PhD Thesis, Hamburg University of TechnologyGoogle Scholar
  16. 16.
    Fagá Jr., R. et al. (2010) A social approach to authoring media annotations. Proceedings of the 10th ACM symposium on document engineering. pp. 17–26 ACMGoogle Scholar
  17. 17.
    Giannakopoulos T, Petridis S (2010) Unsupervised Speaker Clustering in a Linear Discriminant Subspace. Ninth International Conference on Machine Learning and Applications (ICMLA 2010). pp. 1005–1009 IEEE PressGoogle Scholar
  18. 18.
    Giannakopoulos T, Petridis S, Perantonis S (2010) User-driven recognition of audio events in news videos. 2010 Fifth International Workshop Semantic Media Adaptation and Personalization. pp. 44–49 IEEEGoogle Scholar
  19. 19.
    Gries O, Möller R (2010) Gibbs sampling in probabilistic description logics with deterministic dependencies. Proc. of the First International Workshop on Uncertainty in Description Logics, EdinburghGoogle Scholar
  20. 20.
    Gries O, Moller R, Nafissi A, Rosenfeld M, Sokolski K, Wessel M (2010) A probabilistic abduction engine for media interpretation based on ontologies. Proceedings of the Fourth international conference on Web reasoning and rule systems (RR'10). Springer. pp. 182–194Google Scholar
  21. 21.
    Gries O et al. (2010) Media interpretation and companion feedback for multimedia annotation. The 5th International Conference on Semantic and Digital Media Technologies (SAMT 2010), Lecture Notes in Computer Science. Springer. pp. 1–15Google Scholar
  22. 22.
    Guimarães RL, Cesar C, Bulterman DCA (2010) Creating and sharing personalized time-based annotations of videos on the web. Proceedings of the 10th ACM symposium on Document engineering DocEng 10. 27–36Google Scholar
  23. 23.
    Haarslev V, Möller R (2003) Racer: An owl reasoning agent for the semantic web. Procedings of the International Workshop on Applications, Products and Services of Web-based Support Systems, in conjunction with 2003 IEEE/WIC International Conference on Web Intelligence, Vol. 13, pp. 91–95Google Scholar
  24. 24.
    Hagedorn J, Hailpern J, Karahalios KG (2008) VCode and VData: Illustrating a new framework for supporting the video annotation workflow. Proceedings of the working conference on Advanced visual interfaces. pp. 317–321 ACM Press, Napoli, ItalyGoogle Scholar
  25. 25.
    Hunter J, Schroeter R (2008) Co-Annotea: A system for tagging relationships between multiple mixed-media objects. Multimedia IEEE 15(3):42–53CrossRefGoogle Scholar
  26. 26.
    Kaye A (2011) A logic-based approach to multimedia interpretation. PhD Thesis, Hamburg University of TechnologyGoogle Scholar
  27. 27.
    Kipp M (2001) Anvil - a generic annotation tool for multimodal dialogue. Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech). pp. 1367–1370Google Scholar
  28. 28.
    Lowe DG (1999) Object recognition from local scale-invariant features. Proceedings of the Seventh IEEE International Conference on Computer Vision. 2, 8, 1150–1157 vol.2Google Scholar
  29. 29.
    Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41CrossRefGoogle Scholar
  30. 30.
    Nack F, Putz W (2001) Designing annotation before it’s needed. Proceedings of the ninth ACM international conference on Multimedia - MULTIMEDIA ’01. 251–260 ACM Press, New York, New York, USAGoogle Scholar
  31. 31.
    Neuschmied H, Trichet R, Merialdo B (2007) Fast annotation of video objects for interactive TV. Proceedings of the 15th international conference on Multimedia. pp. 158–159 ACM Press, Augsburg, GermanyGoogle Scholar
  32. 32.
    Patel SN, Abowd GD (2004) The ContextCam: Automated point of capture video annotation. UbiComp 2004: Ubiquitous Computing. 301–318Google Scholar
  33. 33.
    Schaeffer S (2007) Graph clustering. Comput Sci Rev 1(1):27–64CrossRefMathSciNetGoogle Scholar
  34. 34.
    Schroeter R, Hunter J, Guerin J, Khan I, Henderson M (2006) A Synchronous Multimedia Annotation System for Secure Collaboratories. 2006 s IEEE International Conference on eScience and Grid Computing eScience06. 41–41Google Scholar
  35. 35.
    Smeaton AF et al. (2006) Evaluation campaigns and TRECVid. In: Wang, J.Z. et al. (eds.) MIR 06 Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval. pp. 321–330 ACM PressGoogle Scholar
  36. 36.
    Toutanova K, Manning CD (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics. 13, 63–70Google Scholar
  37. 37.
    Tsatsaronis, G., Varlamis, I., Vazirgiannis, M. & Norvag, K.: Omiotis: A thesaurus-based measure of text relatedness. Machine Learning and Knowledge Discovery in Databases. 5782, 742–745 (2009).Google Scholar
  38. 38.
    Tsatsaronis G, Vazirgiannis M, Androutsopoulos I (2007) Word Sense Disambiguation with Spreading Activation Networks Generated from Thesauri. Strategy. 1725–1730Google Scholar
  39. 39.
    Viola P, Jones MJ (2004) Robust Real-Time Face Detection. Int J Comput Vis 57(2):137–154CrossRefGoogle Scholar
  40. 40.
    Zhai G, Geoffrey F, Marlon P, Wenjun W, Hasan B (2005) eSports: Collaborative and Synchronous Video Annotation System in Grid Computing Environment. ISM 05 Proceedings of the Seventh IEEE International Symposium on Multimedia. pp. 95–103 IeeeGoogle Scholar

Copyright information

© The Authors 2013

Authors and Affiliations

  • Robert J. Hendley
    • 1
  • Russell Beale
    • 1
  • Chris P. Bowers
    • 1
  • Christos Georgousopoulos
    • 2
  • Charalampos Vassiliou
    • 2
  • Petridis Sergios
    • 3
  • Ralf Moeller
    • 4
  • Eric Karstens
    • 5
  • Dimitris Spiliotopoulos
    • 6
  1. 1.School of Computer Science, University of BirminghamBirminghamUK
  2. 2.INTRASOFT International S.APeaniaGreece
  3. 3.Institute of Informatics and Telecommunications, NCSRAthensGreece
  4. 4.Software Technology and Systems Institute, TUHHHamburgGermany
  5. 5.European Journalism CentreMaastrichtThe Netherlands
  6. 6.Athens Technology Center S.AAthensGreece

Personalised recommendations