On the use of commonsense ontology for multimedia event recounting

  • Chun-Chet TanEmail author
  • Chong-Wah Ngo
Regular Paper


Textually narrating the observed evidences relevant to the reasons why a video clip is being retrieved for an event is still a highly challenging problem. This paper explores the use of a commonsense ontology, namely ConceptNet, in generating short descriptions for recounting the audio–visual evidences. The ontology is exploited as a knowledge engine to provide event–relevant common sense, which is expressed in terms of concepts and their relationships, for semantics understanding, context-based concept screening and sentence synthesis. A principal way of exploiting the ontology, from extracting the event–relevant semantic network to the formation of syntactic parse trees, is outlined and discussed. Experimental results on two benchmark datasets (TRECVID MED and MediaEval) show the effectiveness of our approach. The findings show insights on the usability of common sense for multimedia search, including the feasibility of inferring relevant concepts for event detection, as well as the quality of textual sentences in meeting human expectation.


Event detection Event recounting Ontology 


  1. 1.
    Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23(11):1222–1239CrossRefGoogle Scholar
  2. 2.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-wide: a real-world web image database from National University of Singapore. In: Proceedings of CIVR, pp 48:1–48:9Google Scholar
  3. 3.
    Cilibrasi RL, Vitanyi PMB (2007) The Google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383CrossRefGoogle Scholar
  4. 4.
    Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of ECCV, pp 428–441Google Scholar
  5. 5.
    Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of CVPR, pp 2634–2641Google Scholar
  6. 6.
    Demarty CH, Penet C, Schedl M, Ionescu B, Quang VL, Jiang YG (2013) The MediaEval 2013 affect task: violent scenes detection. In: MediaEval workshopGoogle Scholar
  7. 7.
    Deng J, Dong W, Socher R, Jia Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPRGoogle Scholar
  8. 8.
    Ding D, Metze F, Rawat S, Schulam PF, Burger S, Younessian E, Bao L, Christel MG, Hauptmann A (2012) Beyond audio and video retrieval: towards multimedia summarization. In: Proceedings of ICMR, pp 2:1–2:8Google Scholar
  9. 9.
    Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: Proceedings of ECCV, pp 15–29Google Scholar
  10. 10.
    Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney RJ, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of ICCV, pp 2712–2719Google Scholar
  11. 11.
    Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAIGoogle Scholar
  12. 12.
    Izadinia H, Shah M (2012) Recognizing complex events using large margin joint low-level event model. In: Proceedings of ECCV, pp 430–444Google Scholar
  13. 13.
    Jiang YG, Dai Q, Wang J, Ngo CW, Xue X, Chang SF (2012) Fast semantic diffusion for large scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091MathSciNetCrossRefGoogle Scholar
  14. 14.
    Jiang YG, Ngo CW, Chang SF (2009) Semantic context transfer across heterogeneous sources for domain adaptive video search. In: Proceedings of ACM MM, pp 155–164Google Scholar
  15. 15.
    Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ICMRGoogle Scholar
  16. 16.
    Khan MUG, Zhang L, Gotoh Y (2011) Towards coherent natural language description of video streams. In: ICCV workshops, pp 664–671Google Scholar
  17. 17.
    Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Proceedings of AAAIGoogle Scholar
  18. 18.
    Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. In: Proceedings of CVPR, pp 1601–1608Google Scholar
  19. 19.
    Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of ACL, pp 359–368Google Scholar
  20. 20.
    Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of CoNLL, pp 220–228Google Scholar
  21. 21.
    Lin Y, Michel JB, Aiden EL, Orwant J, Brockman W, Petrov S (2012) Syntactic annotations for the google books n-gram corpus. In: Proceedings of ACL, pp 169–174Google Scholar
  22. 22.
    Liu H, Singh P (2004) Conceptnet—a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226MathSciNetCrossRefGoogle Scholar
  23. 23.
    Liu J, Yu Q, Javed O, Ali S, Tamrakar A, Divakaran A, Cheng H, Sawhney HS (2013) Video event recognition using concept attributes. In: Proceedings of WACV, pp 339–346Google Scholar
  24. 24.
    Ma Z, Hauptmann AG, Yang Y, Sebe N (2012) Classifier-specific intermediate representation for multimedia tasks. In: Proceedings of ICMR, pp 50:1–50:8Google Scholar
  25. 25.
    Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of CVPR, pp 2929–2936Google Scholar
  26. 26.
    Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Downie JS, Veltkamp RC (eds) Proceedings of ISMIR, pp 441–446Google Scholar
  27. 27.
    Mazloom M, Gavves E, van de Sande KEA, Snoek C (2013) Searching informative concept banks for video event detection. In: Proceedings of ICMR, pp 255–262Google Scholar
  28. 28.
    Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimed 14(1):88–101CrossRefGoogle Scholar
  29. 29.
    Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of EACL, pp 747–756Google Scholar
  30. 30.
    Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M, Ye G, Liu D, Chang SF, Saleemi I, Shah M, Davis L, Gupta A, Haritaoglu I, Guler S, Morde A (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshopGoogle Scholar
  31. 31.
    Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of ICMLGoogle Scholar
  32. 32.
    NIST, Information Technology Laboratory: 2012 TRECVID Multimedia Event Detection TrackGoogle Scholar
  33. 33.
    NIST, Information Technology Laboratory: 2013 TRECVID Multimedia Event Recounting TrackGoogle Scholar
  34. 34.
    Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS, pp 1143–1151Google Scholar
  35. 35.
    Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of ICCV, pp 433–440Google Scholar
  36. 36.
    Romano J (1990) On the behavior of randomization tests without a group invariance assumption. J Am Stat Assoc 85(411):686–692MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of ACM MM, pp 421–430Google Scholar
  38. 38.
    Speer R, Havasi C, Lieberman H (2008) Analogyspace: reducing the dimensionality of common sense knowledge. In: Proceedings of AAAI, pp 548–553Google Scholar
  39. 39.
    Sun C, Burns B, Nevatia R, Snoek CGM, Bolles B, Myers GK, Wang W, Yeh E (2014) Isomer: informative segment observations for multimedia event recounting. In: Proceedings of ICMRGoogle Scholar
  40. 40.
    Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio–visual concept classifiers. In: Proceedings of ACM MM, pp 655–658Google Scholar
  41. 41.
    Tan CC, Ngo CW (2013) The vireo team at MediaEval 2013: violent scenes detection by mid-level concepts learnt from youtube. In: MediaEval, Proceedings of CEUR workshop, vol 1043Google Scholar
  42. 42.
    Torralba A, Murphy KP, Freeman WT (2010) Using the forest to see the trees: exploiting context for visual object detection and localization. Commun ACM 53(3):107–114CrossRefGoogle Scholar
  43. 43.
    Verma Y, Gupta A, Mannem P, Jawahar CV (2013) Generating image descriptions using semantic similarities in the output space. In: CVPR workshops, pp 288–293Google Scholar
  44. 44.
    Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. CoRRGoogle Scholar
  45. 45.
    Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of CVPR, pp 3169–3176. Colorado Springs, USAGoogle Scholar
  46. 46.
    Weng MF, Chuang YY (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell 34(10):1927–1941CrossRefGoogle Scholar
  47. 47.
    Yanagawa A, Chang SF, Kennedy L, Hsu W (2007) Columbia University’s baseline detectors for 374 LSCOM semantic visual concepts. Technical report, Columbia UniversityGoogle Scholar
  48. 48.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRefGoogle Scholar
  49. 49.
    Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of EMNLP, pp 444–454Google Scholar
  50. 50.
    Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceCity University of Hong KongHong KongChina

Personalised recommendations