Multimedia Tools and Applications

, Volume 30, Issue 3, pp 229–253 | Cite as

Video retrieval of near-duplicates using κ-nearest neighbor retrieval of spatio-temporal descriptors

  • Daniel DeMenthonEmail author
  • David Doermann


This paper describes a novel methodology for implementing video search functions such as retrieval of near-duplicate videos and recognition of actions in surveillance video. Videos are divided into half-second clips whose stacked frames produce 3D space-time volumes of pixels. Pixel regions with consistent color and motion properties are extracted from these 3D volumes by a threshold-free hierarchical space-time segmentation technique. Each region is then described by a high-dimensional point whose components represent the position, orientation and, when possible, color of the region. In the indexing phase for a video database, these points are assigned labels that specify their video clip of origin. All the labeled points for all the clips are stored into a single binary tree for efficient \(\kappa \)-nearest neighbor retrieval. The retrieval phase uses video segments as queries. Half-second clips of these queries are again segmented by space-time segmentation to produce sets of points, and for each point the labels of its nearest neighbors are retrieved. The labels that receive the largest numbers of votes correspond to the database clips that are the most similar to the query video segment. We illustrate this approach for video indexing and retrieval and for action recognition. First, we describe retrieval experiments for dynamic logos, and for video queries that differ from the indexed broadcasts by the addition of large overlays. Then we describe experiments in which office actions (such as pulling and closing drawers, taking and storing items, picking up and putting down a phone) are recognized. Color information is ignored to insure independence of action recognition to people's appearance. One of the distinct advantages of using this approach for action recognition is that there is no need for detection or recognition of body parts.


Content-based indexing and retrieval Video retrieval of near-duplicates Action recognition Space-time segmentation Spatio-temporal descriptors Object motion 



Support of this research by the Department of Defense under Contract MDA 9049-6C-1250 is gratefully acknowledged. The authors would also like to thank the anonymous reviewers for their helpful comments and suggestions.


  1. 1.
    Akkusand A, Guvenir HA (1996) K nearest neighbor classification on feature projections. Proceedings of ICML 12–19Google Scholar
  2. 2.
    Allmen M, Dyer CR (1993) Computing spatiotemporal relations for dynamic perceptual organization. CVGIP, Image Underst 58:338–351CrossRefGoogle Scholar
  3. 3.
    Bay SD (1999) Nearest neighbor classification from multiple feature subsets. Intell Data Anal 3(3):191–209CrossRefGoogle Scholar
  4. 4.
    Bruno E, Pellerin D (2000) Global motion Fourier series expansion for video indexing and retrieval. Advances in visual information system. Visual, Lyon, pp 327–337Google Scholar
  5. 5.
    Bruno E, Pellerin D (2002) Video structuring, indexing and retrieval based on global motion wavelet coefficients. Proc Int Conf of Pattern Recognition (ICPR), Quebec City, CanadaGoogle Scholar
  6. 6.
    Cheng Y (1995) Mean shift, mode seeking, and clustering. IEEE Trans PAMI 17:790–799Google Scholar
  7. 7.
    Comaniciu D, Meer P (2002) Mean shift: A robust approach toward feature space analysis. IEEE Trans PAMI 24:603–619Google Scholar
  8. 8.
    Del Bimbo A, Pala P, Tanganelli L (2000) Video retrieval based on dynamics of color flows. ICPR 1:851–854Google Scholar
  9. 9.
    DeMenthon D (2002) Spatio-temporal segmentation of video by hierarchical mean shift analysis. SMVP 2002 (Statistical methods in video processing workshop). Copenhagen, DenmarkGoogle Scholar
  10. 10.
    DeMenthon D, Kobla V, Doermann DS (1998) Video summarization by curve simplification. ACM Multimedia 211–218Google Scholar
  11. 11.
    Dimitrova N, Golshani F (1995) Motion recovery for video content classification. ACM Trans Inf Sys 13(4):408–439CrossRefGoogle Scholar
  12. 12.
    Dimitrova N, Abdel-Mottaleb M (1997) Content-based video retrieval by example video clip. Proc SPIE, Storage Retr Image Video Databases 3022:59–70Google Scholar
  13. 13.
    Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New YorkzbMATHGoogle Scholar
  14. 14.
    Fablet R, Bouthemy P, Perez P (2002) Non-parametric motion characterization using causal probabilistic models for video indexing and retrieval. IEEE Trans Image Processing 11(4):393–407CrossRefGoogle Scholar
  15. 15.
    Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image and video content: The QBIC system. Comput 28(9):23–32CrossRefGoogle Scholar
  16. 16.
    Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic, New YorkzbMATHGoogle Scholar
  17. 17.
    Hampapur A, Gupta A, Horowitz B, Shu C-F, Fuller C, Bach J, Gorkani M, Jain R (1997) Virage video engine. Proc. SPIE, Storage Retr Image Video Databases 3022:188–198Google Scholar
  18. 18.
    Ivanov Y, Bobick A (2000) Recognition of visual activities and interactions by stochastic parsing. IEEE Trans PAMI 22(8):852–872Google Scholar
  19. 19.
    Kobla V, Doermann D (1998) Indexing and retrieval of MPEG-compressed video. J Electron Imaging 294–307Google Scholar
  20. 20.
    Leung Y, Zhang J-S, Xu Z-B (2000) Clustering by scale-space filtering. IEEE Trans PAMI 22:1396–1410Google Scholar
  21. 21.
    Lienhart R, Effelsberg W, Jain R (1998) Visual GREP: A systematic method to compare and retrieve video sequences. Proc. SPIE, Storage Retr Image Video Databases 3312:271–282Google Scholar
  22. 22.
    Lucas BD, Kanade T (1985) Optical navigation by the method of differences. IJCAI 981–984Google Scholar
  23. 23.
    Martin D, Fowlkes C, Malik J (2004) Learning to detect natural image boundaries using local brightness, color and texture cues. IEEE Trans PAMI 26(5):530–549Google Scholar
  24. 24.
    Merkwirth C, Parlitz U, Lautherborn W (2000) Fast nearest-neighbor searching for nonlinear signal processing. Phys Review E 62:2089–2097. TSTool package available at Google Scholar
  25. 25.
    Nelson RC, Selinger A (1998) A cubist approach to object recognition. Proc ICCV. Bombay, India, pp 614–621Google Scholar
  26. 26.
    Oliver N, Rosario B, Pentland A, Bayesian A (2000) Computer vision system for modeling human interactions. IEEE Trans PAMI 22(8):831–843Google Scholar
  27. 27.
    Pinhanez C, Bobick A (1998) Human action detection using PNF propagation of temporal constraints, Proc. CVPR, pp 898–904Google Scholar
  28. 28.
    Ricquebourg Y, Bouthemy P (2000) Real-time tracking of moving persons by exploiting spatio-temporal image slices. IEEE Trans PAMI 22(8):797–808Google Scholar
  29. 29.
    Sahouria E, Zakhor A (1997) Motion indexing of video. ICIP 2:526–529Google Scholar
  30. 30.
    Schmid C, Mohr R (1997) Local gray-value invariants for image retrieval. IEEE Trans PAMI 19(5):530–535Google Scholar
  31. 31.
    Sun H, Feng T, Tan T (2000) Spatio-temporal segmentation for video surveillance. ICPR 2000 1:843–846Google Scholar
  32. 32.
    Syeda-Mahmood TF, Vasilescu A, Sethi S (2001) Recognizing action events in video. IEEE Workshop on Event Detection and Recognition in Video 64–72Google Scholar
  33. 33.
    Uhlmann JK (1991) Satisfying general proximity/similarity queries with metric trees. Inf Process Lett 40:175–179zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Language and Media Processing (LAMP)University of Maryland Institute for Advanced Computer Studies (UMIACS)College ParkUSA

Personalised recommendations