Local Invariant Feature Tracks for High-Level Video Feature Extraction

  • Vasileios Mezaris
  • Anastasios Dimou
  • Ioannis Kompatsiaris
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 158)


In this work the use of feature tracks for the detection of high-level features (concepts) in video is proposed. Extending previous work on local interest point detection and description in images, feature tracks are defined as sets of local interest points that are found in different frames of a video shot and exhibit spatio-temporal and visual continuity, thus defining a trajectory in the 2D+Time space. These tracks jointly capture the spatial attributes of 2D local regions and their corresponding long-term motion. The extraction of feature tracks and the selection and representation of an appropriate subset of them allow the generation of a Bag-of-Spatiotemporal-Words model for the shot, which facilitates capturing the dynamics of video content. Experimental evaluation of the proposed approach on two challenging datasets (TRECVID 2007, TRECVID 2010) highlights how the selection, representation and use of such feature tracks enhances the results of traditional keyframe-based concept detection techniques.


Feature tracks Video concept detection Trajectory LIFT descriptor Bag-of-Spatiotemporal-Words 



This work was supported by the European Commission under contract FP7-248984 GLOCAL.


  1. 1.
    Mezaris V, Kompatsiaris I, Boulgouris N, Strintzis M (2004) Real-time compressed-domain spatiotemporal segmentation and ontologies for video indexing and retrieval. IEEE Trans Circuits Syst Video Technol 14(5):606–621Google Scholar
  2. 2.
    Mezaris V, Kompatsiaris I, Strintzis M (2004) Video object segmentation using Bayes-based temporal tracking and trajectory-based region merging. IEEE Trans Circuits Syst Video Technol 14(6):782–795Google Scholar
  3. 3.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91–110CrossRefGoogle Scholar
  4. 4.
    Dance C, Willamowski J, Fan L, Bray C, Csurka G (2004) Visual categorization with bags of keypoints. In: Proceedings of the ECCV international workshop on statistical learning in computer vision, Prague, Czech Republic, May 2004Google Scholar
  5. 5.
    Mezaris V, Sidiropoulos P, Dimou A, Kompatsiaris I (2010) On the use of visual soft semantics for video temporal decomposition to scenes. In: Proceedings of the fourth IEEE international conference on semantic computing (ICSC 2010), Pittsburgh, PA, USA, Sept 2010Google Scholar
  6. 6.
    Gkalelis N, Mezaris V, Kompatsiaris I (2010) Automatic event-based indexing of multimedia content using a joint content-event model. In: Proceedings of the ACM multimedia 2010, events in multiMedia workshop (EiMM10), Firenze, Italy, Oct 2010Google Scholar
  7. 7.
    Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615–1630CrossRefGoogle Scholar
  8. 8.
    Bay H, Ess A, Tuytelaars T, Gool LV (2008) Surf: speeded up robust features. Comput Vis Image Underst 110(3):346–359CrossRefGoogle Scholar
  9. 9.
    Burghouts GJ, Geusebroek JM (2009) Performance evaluation of local colour invariants. Comput Vis Image Underst 113:48–62CrossRefGoogle Scholar
  10. 10.
    Smeaton AF, Over P, Kraaij W (2009) High-level feature detection from video in TRECVid: a 5-Year retrospective of achievements. In: Divakaran A (ed) Multimedia content analysis, signals and communication technology. Springer, Berlin, pp 151–174Google Scholar
  11. 11.
    Piro P, Anthoine S, Debreuve E, Barlaud M (2010) Combining spatial and temporal patches for scalable video indexing. Multimedia Tools Appl 48(1):89–104CrossRefGoogle Scholar
  12. 12.
    Snoek C, van de Sande K, de Rooij O et al (2008) The MediaMill TRECVID 2008 semantic video search engine. In: Proceedings of the TRECVID 2008 workshop, USA, Nov 2008Google Scholar
  13. 13.
    Ballan L, Bertini M, Bimbo AD, Serra G (2010) Video event classification using String Kernels. Multimedia Tools Appl 48(1):69–87CrossRefGoogle Scholar
  14. 14.
    Chen M, Hauptmann A (2009) Mo SIFT: recognizing human actions in surveillance videos. Technical Report CMU-CS-09-161, Carnegie Mellon UniversityGoogle Scholar
  15. 15.
    Laptev I (2005) On space-time interest points. Int J Comput Vision 64(2/3):107–123CrossRefGoogle Scholar
  16. 16.
    Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vision 79(3):299–318Google Scholar
  17. 17.
    Zhou H, Yuan Y, Shi C (2009) Object tracking using SIFT features and mean shift. Comput Vision Image Underst 113(3):345–352CrossRefGoogle Scholar
  18. 18.
    Tsuduki Y, Fujiyoshi H (2009) A method for visualizing pedestrian traffic flow using SIFT feature point tracking. In: Proceedings of the 3rd Pacific-Rim symposium on image and video technology, Tokyo, Japan, Jan 2009Google Scholar
  19. 19.
    Anjulan A, Canagarajah N (2009) A unified framework for object retrieval and mining. IEEE IEEE Trans Circuits Syst Video Technol 19(1):63–76Google Scholar
  20. 20.
    Moenne-Loccoz N, Bruno E, Marchand-Maillet S (2006) Local feature trajectories for efficient event-based indexing of video sequences. In: Proceedings of the international conference on image and video retrieval (CIVR), Tempe, USA, July 2006Google Scholar
  21. 21.
    Sun J, Wu X, Yan S, Cheong L, Chua TS, Li J (2009) Hierarchical spatio-temporal context modeling for action recognition. In: Proceedings international conference on computer vision and pattern recognition (CVPR), Miami, USA, June 2009Google Scholar
  22. 22.
    Lazebnik S, Schmid C, Ponce J (2009) Spatial pyramid matching. In: Dickinson S, Leonardis A, Schiele B, Tarr M (eds) Object categorization: computer and human vision perspectives. Cambridge University Press, CambridgeGoogle Scholar
  23. 23.
    Moumtzidou A, Dimou A, Gkalelis N, Vrochidis S, Mezaris V, Kompatsiaris I (2010) ITI-CERTH participation to TRECVID 2010. In: Proceedings of the TRECVID 2010 workshop, USA, Nov 2010Google Scholar
  24. 24.
    Yilmaz E, Kanoulas E, Aslam J (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in, information retrieval (SIGIR), pp 603–610Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Vasileios Mezaris
    • 1
  • Anastasios Dimou
    • 1
  • Ioannis Kompatsiaris
    • 1
  1. 1.Centre for Research and Technology HellasInformatics and Telematics InstituteThermiGreece

Personalised recommendations