Multimedia Tools and Applications

, Volume 70, Issue 1, pp 413–432 | Cite as

An innovative web-based collaborative platform for video annotation

  • Isaak Kavasidis
  • Simone Palazzo
  • Roberto Di Salvo
  • Daniela Giordano
  • Concetto Spampinato
Article

Abstract

Large scale labeled datasets are of key importance for the development of automatic video analysis tools as they, from one hand, allow multi-class classifiers training and, from the other hand, support the algorithms’ evaluation phase. This is widely recognized by the multimedia and computer vision communities, as witnessed by the growing number of available datasets; however, the research still lacks in annotation tools able to meet user needs, since a lot of human concentration is necessary to generate high quality ground truth data. Nevertheless, it is not feasible to collect large video ground truths, covering as much scenarios and object categories as possible, by exploiting only the effort of isolated research groups. In this paper we present a collaborative web-based platform for video ground truth annotation. It features an easy and intuitive user interface that allows plain video annotation and instant sharing/integration of the generated ground truths, in order to not only alleviate a large part of the effort and time needed, but also to increase the quality of the generated annotations. The tool has been on-line in the last four months and, at the current date, we have collected about 70,000 annotations. A comparative performance evaluation has also shown that our system outperforms existing state of the art methods in terms of annotation time, annotation quality and system’s usability.

Keywords

Ground truth data Video labeling Object detection Object tracking Image segmentation 

References

  1. 1.
    Ahn LV (2006) Games with a purpose. Computer 39(6):92–94CrossRefGoogle Scholar
  2. 2.
    Ambardekar A, Nicolescu M, Dascalu S (2009) Ground truth verification tool (GTVT) for video surveillance systems. In: Proceedings of the 2009 second international conferences on advances in computer-human interactions, ser. ACHI ’09, pp 354–359Google Scholar
  3. 3.
    Barbour B, Ricanek Jr K (2012) An interactive tool for extremely dense landmarking of faces. In: Proceedings of the 1st international workshop on visual interfaces for ground truth collection in computer vision applications, ser. VIGTA ’12. ACM, New York, pp 13:1–13:5Google Scholar
  4. 4.
    Barnich O, Van Droogenbroeck M (2011) ViBe: a universal background subtraction algorithm for video sequences. IEEE Trans Image Process 20(6):1709–1724 [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/21189241 CrossRefMathSciNetGoogle Scholar
  5. 5.
    Bassel GW, Glaab E, Marquez J, Holdsworth MJ, Bacardit J (2011) Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets. Plant Cell 23(9):3101–3116CrossRefGoogle Scholar
  6. 6.
    Bertini M, Del Bimbo A, Torniai C (2005) Automatic video annotation using ontologies extended with visual information. In: Proceedings of the 13th annual ACM international conference on multimedia, ser. MULTIMEDIA ’05, pp 395–398Google Scholar
  7. 7.
    Biewald L (2012) Massive multiplayer human computation for fun, money, and survival. In: Proceedings of the 11th international conference on current trends in web engineering, ser. ICWE’11, pp 171–176Google Scholar
  8. 8.
    Blake A, Isard M (1996) The condensation algorithm—conditional density propagation and applications to visual tracking. In: Advances in neural information processing systems. MIT Press, pp 655–668Google Scholar
  9. 9.
    Brabham D (2008) Crowdsourcing as a model for problem solving an introduction and cases. Convergence 14(1):75–90Google Scholar
  10. 10.
    Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698CrossRefGoogle Scholar
  11. 11.
    Chin JP, Diehl VA, Norman KL (1988) Development of an instrument measuring user satisfaction of the human-computer interface. In: Proceedings of the SIGCHI conference on human factors in computing systems, ser. CHI ’88. ACM, New York, pp 213–218CrossRefGoogle Scholar
  12. 12.
    Doerman D, Mihalcik D (2000) Tools and techniques for video performance evaluation. In: Proceedings of 15th international conference on pattern recognition, vol 4, pp 167–170Google Scholar
  13. 13.
    Faro A, Giordano D, Spampinato C (2011) Adaptive background modeling integrated with luminosity sensors and occlusion processing for reliable vehicle detection. IEEE Trans Intell Transp Syst 12:1398–1412CrossRefGoogle Scholar
  14. 14.
    Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70CrossRefGoogle Scholar
  15. 15.
    Fisher R (2004) CAVIAR test case scenarios. Online BookGoogle Scholar
  16. 16.
    Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory. Springer, pp 23–37Google Scholar
  17. 17.
    Giordano D, Kavasidis I, Pino C, Spampinato C (2011) A semantic-based and adaptive architecture for automatic multimedia retrieval composition. In: 2011 9th international workshop on content-based multimedia indexing (CBMI), pp 181–186Google Scholar
  18. 18.
    Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset. California Institute of Technology, Tech. Rep. 7694Google Scholar
  19. 19.
    Heroux P, Barbu E, Adam S, Trupin E (2007) Automatic ground-truth generation for document image analysis and understanding. In: Proceedings of the ninth international conference on document analysis and recognition, ser. ICDAR ’07, vol 01, pp 476–480Google Scholar
  20. 20.
    Howe J (2006) The rise of crowdsourcing. Wired Magazine 14(6):1–4MathSciNetGoogle Scholar
  21. 21.
    Jaynes C, Webb S, Steele R, Xiong Q (2002) An open development environment for evaluation of video surveillance systems. In: PETS02, pp 32–39Google Scholar
  22. 22.
    Kass M, Witkin A, Terzopoulos D (1988) Snakes: active contour models. Int J Comput Vis 1: 321–331CrossRefGoogle Scholar
  23. 23.
    Kavasidis I, Palazzo S, Di Salvo R, Giordano D, Spampinato C (2012) A semi-automatic tool for detection and tracking ground truth generation in videos. In: VIGTA ’12: proceedings of the 1st international workshop on visual interfaces for ground truth collection in computer vision applications. ACM, New York, pp 1–5CrossRefGoogle Scholar
  24. 24.
    Kawahara T, Nanjo H, Shinozaki T, Furui S (2003) Benchmark test for speech recognition using the corpus. In: Proceedings of ISCA & IEEE workshop on spontaneous speech processing and recognition, pp 135–138Google Scholar
  25. 25.
    Mai HT, Kim MH (2013) Utilizing similarity relationships among existing data for high accuracy processing of content-based image retrieval. Multimed Tools Appl. doi:10.1007/s11042-013-1360-9
  26. 26.
    Marques O, Barman N (2003) Semi-automatic semantic annotation of images using machine learning techniques. The Semantic Web-ISWC 2003, pp 550–565Google Scholar
  27. 27.
    Martin D, Fowlkes C, Tal D, Malik J (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th int’l conf. computer vision, vol 2, pp 416–423Google Scholar
  28. 28.
    Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the national conference on artificial intelligence, vol 21, no 1. AAAI Press, Menlo Park, MIT Press, Cambridge, p 775, 1999Google Scholar
  29. 29.
    Moehrmann J, Heidemann G (2012) Efficient annotation of image data sets for computer vision applications. In: Proceedings of the 1st international workshop on visual interfaces for ground truth collection in computer vision applications, ser. VIGTA ’12, pp 2:1–2:6Google Scholar
  30. 30.
    Mutch J, Lowe D (2008) Object class recognition and localization using sparse features with limited receptive fields. Int J Comput Vis 80:45–57CrossRefGoogle Scholar
  31. 31.
    Quinn AJ, Bederson BB (2011) Human computation: a survey and taxonomy of a growing field. In: Proceedings of the 2011 annual conference on human factors in computing systems, ser. CHI ’11, pp 1403–1412Google Scholar
  32. 32.
    Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, ser. CSLDAMT ’10, pp 139–147Google Scholar
  33. 33.
    Rother C, Kolmogorov V, Blake A (2004) “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM Trans Graph 23(3):309–314CrossRefGoogle Scholar
  34. 34.
    Rotter P (2013) Relevance feedback based on n-tuplewise comparison and the ELECTRE methodology and an application in content-based image retrieval. Multimed Tools Appl. doi:10.1007/s11042-013-1384-1
  35. 35.
    Russell BC, Torralba A, Murphy KP, Freeman WT (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vis 77(1–3):157–173CrossRefGoogle Scholar
  36. 36.
    Sigal L, Balan A, Black M (2010) Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int J Comput Vis 87(1):4–27. doi:10.1007/s11263-009-0273-6 CrossRefGoogle Scholar
  37. 37.
    Sorokin A, Forsyth D (2008) Utility data annotation with Amazon Mechanical Turk. In: 2008 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, Piscataway, pp 1–8CrossRefGoogle Scholar
  38. 38.
    Spampinato C, Boom B, He J (eds) (2012) VIGTA ’12: proceedings of the 1st international workshop on visual interfaces for ground truth collection in computer vision applications. ACM, New YorkGoogle Scholar
  39. 39.
    Spampinato C, Palazzo S, Boom B, van Ossenbruggen J, Kavasidis I, Di R, Salvo Lin F, Giordano D, Hardman L, Fisher R (2012) Understanding fish behavior during typhoon events in real-life underwater environments. Multimed Tools Appl. doi:10.1007/s11042-012-1101-5
  40. 40.
    Spampinato C, Palazzo S, Giordano D, Kavasidis I, Lin F, Lin Y (2012) Covariance based fish tracking in real-life underwater environment. In: International conference on computer vision theory and applications, VISAPP 2012, pp 409–414Google Scholar
  41. 41.
    Stork DG (1999) Character and document research in the open mind initiative. In: Proceedings of the fifth international conference on document analysis and recognition, ser. ICDAR ’99Google Scholar
  42. 42.
    Utasi A, Benedek C (2012) A multi-view annotation tool for people detection evaluation. In: Proceedings of the 1st international workshop on visual interfaces for ground truth collection in computer vision applications, ser. VIGTA ’12, pp 3:1–3:6Google Scholar
  43. 43.
    von Ahn L, Dabbish L (2004) Labeling images with a computer game. In: Proceedings of the SIGCHI conference on human factors in computing systems, ser. CHI ’04. ACM, New York, pp 319–326 [Online]. Available: http://doi.acm.org/10.1145/985692.985733
  44. 44.
    Wache H, Voegele T, Visser U, Stuckenschmidt H, Schuster G, Neumann H, Hübner S (2001) Ontology-based integration of information-a survey of existing approaches. In: IJCAI-01 workshop: ontologies and information sharing, vol 2001. Citeseer, pp 108–117Google Scholar
  45. 45.
    Yuen J, Russell BC, Liu C, Torralba A (2009) Labelme video: building a video database with human annotations. In: ICCV’09, pp 1451–1458Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Isaak Kavasidis
    • 1
  • Simone Palazzo
    • 1
  • Roberto Di Salvo
    • 1
  • Daniela Giordano
    • 1
  • Concetto Spampinato
    • 1
  1. 1.Department of Electrical, Electronics and Computer EngineeringUniversity of CataniaCataniaItaly

Personalised recommendations