International Journal of Computer Vision

, Volume 83, Issue 1, pp 57–71 | Cite as

Seeing the Objects Behind the Dots: Recognition in Videos from a Moving Camera

Article

Abstract

Category-level object recognition, segmentation, and tracking in videos becomes highly challenging when applied to sequences from a hand-held camera that features extensive motion and zooming. An additional challenge is then to develop a fully automatic video analysis system that works without manual initialization of a tracker or other human intervention, both during training and during recognition, despite background clutter and other distracting objects. Moreover, our working hypothesis states that category-level recognition is possible based only on an erratic, flickering pattern of interest point locations without extracting additional features. Compositions of these points are then tracked individually by estimating a parametric motion model. Groups of compositions segment a video frame into the various objects that are present and into background clutter. Objects can then be recognized and tracked based on the motion of their compositions and on the shape they form. Finally, the combination of this flow-based representation with an appearance-based one is investigated. Besides evaluating the approach on a challenging video categorization database with significant camera motion and clutter, we also demonstrate that it generalizes to action recognition in a natural way.

Keywords

Object recognition Segmentation Tracking Video analysis Compositionality Visual learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11263_2009_211_MOESM1_ESM.mpg (1.1 mb)
Below is the link to the electronic supplementary material
11263_2009_211_MOESM2_ESM.mpg (1.9 mb)
Below is the link to the electronic supplementary material
11263_2009_211_MOESM3_ESM.mpg (1.6 mb)
Below is the link to the electronic supplementary material
11263_2009_211_MOESM4_ESM.mpg (3.9 mb)
Below is the link to the electronic supplementary material

References

  1. Avidan, S. (2005). Ensemble tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 494–501). Google Scholar
  2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the IEEE international conference on computer vision (pp. 1395–1402). Google Scholar
  3. Brostow, G. J., & Cipolla, R. (2006). Unsupervised Bayesian detection of independent motion in crowds. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 594–601). Google Scholar
  4. Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In Proceedings of the European conference on computer vision, (pp. 44–57). Google Scholar
  5. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. Google Scholar
  6. Comaniciu, D., Ramesh, V., & Meer, P. (2003). Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564–575. CrossRefGoogle Scholar
  7. Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Proceedings of the European conference on computer vision. Workshop stat. learn. in comp. vis. Google Scholar
  8. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of the European conference on computer vision (pp. 428–441). Google Scholar
  9. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. J. (2005). Behavior recognition via sparse spatio-temporal features. In International workshop on performance evaluation of tracking and surveillance (pp. 65–72). Google Scholar
  10. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79. CrossRefGoogle Scholar
  11. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 264–271). Google Scholar
  12. Goldberger, J., & Greenspann, H. (2006). Context-based segmentation of image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3), 463–468. CrossRefGoogle Scholar
  13. Grabner, M., Grabner, H., & Bischof, H. (2007). Learning features for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  14. Hartley, R. I., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press. Google Scholar
  15. Irani, M., Rousso, B., & Peleg, S. (1994). Computing occluding and transparent motions. International Journal of Computer Vision, 12(1), 5–16. CrossRefGoogle Scholar
  16. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings of the IEEE international conference on computer vision. Google Scholar
  17. Jin, Y., & Geman, S. (2006). Context and hierarchy in a probabilistic image model. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2145–2152). Google Scholar
  18. Pawan Kumar, M., Torr, P. H., & Zisserman, A. (2008). Learning layered motion segmentations of video. International Journal of Computer Vision, 76(3), 301–319. CrossRefGoogle Scholar
  19. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2169–2178). Google Scholar
  20. Leibe, B., Cornelis, N., Cornelis, K., & Van Gool, L. (2007). Dynamic 3D scene analysis from a moving vehicle. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  21. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Proceedings of the European conference on computer vision. Workshop stat. learn. in comp. vis. Google Scholar
  22. Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for real-time keypoint recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 775–781). Google Scholar
  23. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  24. Lucas, B., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the international joint conference on artificial intelligence (pp. 674–679). Google Scholar
  25. Magee, D. R., & Boyle, R. D. (2002). Detecting lameness using ‘re-sampling condensation’ and ‘multi-stream cyclic hidden Markov models’. Image and Vision Computing, 20(8), 581–594. CrossRefGoogle Scholar
  26. Mahindroo, A., Bose, B., Chaudhury, S., & Harit, G. (2002). Enhanced video representation using objects. In Proceedings of the Indian conference on computer vision (pp. 105–112). Google Scholar
  27. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2), 431–441. MATHCrossRefMathSciNetGoogle Scholar
  28. McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: John Wiley. MATHGoogle Scholar
  29. Niebles, J. C., & Fei Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  30. Ommer, B., & Buhmann, J. M. (2006). Learning compositional categorization models. In Proceedings of the European conference on computer vision (pp. 316–329). Google Scholar
  31. Ommer, B., & Buhmann, J. M. (2007). Compositional object recognition, segmentation, and tracking in video. In Energy minimization methods in computer vision and pattern recognition (pp. 318–333). Google Scholar
  32. Ommer, B., & Buhmann, J. M. (2007). Learning the compositional nature of visual objects. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  33. Perera, A. G. A., Brooksby, G., Hoogs, A., & Doretto, G. (2006). Moving object segmentation using scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on perceptual organization in computer vision. Google Scholar
  34. Pontil, M., Rogai, S., & Verri, A. (1998). Recognizing 3-d objects with linear support vector machines. In Proceedings of the European conference on computer vision (pp. 469–483). Google Scholar
  35. Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In Proceedings of the international conference on pattern recognition (pp. 32–36). Google Scholar
  36. Seemann, E., & Schiele, B. (2006). Cross-articulation learning for robust detection of pedestrians. In Pattern recognition (symposium of the DAGM) (pp. 242–252). Google Scholar
  37. Shi, J., & Tomasi, C. (1994). Good features to track. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 593–600). Google Scholar
  38. Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their localization in images. In Proceedings of the IEEE international conference on computer vision (pp. 370–377). Google Scholar
  39. Sivic, J., Schaffalitzky, F., & Zisserman, A. (2006). Object level grouping for video shots. International Journal of Computer Vision, 67(2), 189–210. CrossRefGoogle Scholar
  40. Stauffer, C., & Grimson, W. E. L. (1999). Adaptive background mixture models for real-time tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 246–252). Google Scholar
  41. Vidal, R., Ma, Y., & Sastry, S. (2003). Generalized principal component analysis (GPCA). In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 621–628). Google Scholar
  42. Vidal, R., & Ravichandran, A. (2005). Optical flow estimation and segmentation of multiple moving dynamic textures. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 516–521). Google Scholar
  43. Viola, P., Jones, M. J., & Snow, D. (2003). Detecting pedestrians using patterns of motion and appearance. In Proceedings of the IEEE international conference on computer vision (pp. 734–741). Google Scholar
  44. Wallraven, C., & Bülthoff, H. H. (2001). Automatic acquisition of exemplar-based representations for recognition from image sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. Workshop on models vs. exemplars. Google Scholar
  45. Wang, J. Y. A., & Adelson, E. H. (1994). Representing moving images with layers. IEEE Transactions on Image Processing, 3(5), 625–638. CrossRefGoogle Scholar
  46. Yan, J. Y., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In Proceedings of the European conference on computer vision (pp. 94–106). Google Scholar
  47. Zhang, H., Berg, A. C., Maire, M., & Malik, J. (2006). SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2126–2133). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Björn Ommer
    • 1
  • Theodor Mader
    • 2
  • Joachim M. Buhmann
    • 2
  1. 1.Department of EECSUniversity of CaliforniaBerkeleyUSA
  2. 2.Department of Computer ScienceETH ZurichZurichSwitzerland

Personalised recommendations