Object Level Grouping for Video Shots

  • Josef Sivic
  • Frederik Schaffalitzky
  • Andrew Zisserman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3022)


We describe a method for automatically associating image patches from frames of a movie shot into object-level groups. The method employs both the appearance and motion of the patches.

There are two areas of innovation: first, affine invariant regions are used to repair short gaps in individual tracks and also to join sets of tracks across occlusions (where many tracks are lost simultaneously); second, a robust affine factorization method is developed which is able to cope with motion degeneracy. This factorization is used to associate tracks into object-level groups.

The outcome is that separate parts of an object that are never visible simultaneously in a single frame are associated together. For example, the front and back of a car, or the front and side of a face. In turn this enables object-level matching and recognition throughout a video.

We illustrate the method for a number of shots from the feature film ‘Groundhog Day’.


Invariant Region Object Level Video Shot Query Region Reprojection Error 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aanaes, H., Fisker, R., Astrom, K., Carstensen, J.M.: Robust factorization. IEEE PAMI 24, 1215–1225 (2002)Google Scholar
  2. 2.
    Bolles, R.C., Baker, H.H., Marimont, D.H.: Epipolar-plane image analysis: An approach to determining structure from motion. IJCV 1(1), 7–56 (1987)CrossRefGoogle Scholar
  3. 3.
    De la Torre, F., Black, M.J.: A framework for robust subspace learning. IJCV 54, 117–142 (2003)zbMATHCrossRefGoogle Scholar
  4. 4.
    Ferrari, V., Tuytelaars, T., Van Gool, L.: Wide-baseline multiple-view correspondences. In: Proc. CVPR, pp. 718–725 (2003)Google Scholar
  5. 5.
    Fitzgibbon, A., Zisserman, A.: Automatic camera tracking. In: Shah, Kumar (eds.) Video Registration, Kluwer, Dordrecht (2003)Google Scholar
  6. 6.
    Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) ISBN: 0521623049zbMATHGoogle Scholar
  7. 7.
    Jacobs, D.W.: Linear fitting with missing data: applications to structure-from-motion and to characterizing intensity images. In: Proc. CVPR, pp. 206–212 (1997)Google Scholar
  8. 8.
    Lowe, D.: Object recognition from local scale-invariant features. In: Proc. ICCV, pp. 1150–1157 (1999)Google Scholar
  9. 9.
    Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proc. BMVC, pp. 384–393 (2002)Google Scholar
  10. 10.
    Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 128–142. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  11. 11.
    Schaffalitzky, F., Zisserman, A.: Multi-view matching for unordered image sets, or “How do I organize my holiday snaps? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 414–431. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  12. 12.
    Schmid, C.: Appariement d’Images par Invariants Locaux de Niveaux de Gris. PhD thesis, L’Institut National Polytechnique de Grenoble, Grenoble (1997)Google Scholar
  13. 13.
    Shum, H.-Y., Ikeuchi, I., Reddy, R.: Principal component analysis with missing data and its application to polyhedral object modeling. IEEE PAMI 17, 854–867 (1995)Google Scholar
  14. 14.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Proc. ICCV (2003)Google Scholar
  15. 15.
    Torr, P.H.S.: Motion segmentation and outlier detection. PhD thesis, Dept. of Engineering Science, University of Oxford (1995)Google Scholar
  16. 16.
    Torr, P.H.S., Szeliski, R., Anadan, P.: An integrated bayesian approach to layer extraction from image sequence. IEEE PAMI 23, 297–304 (2001)Google Scholar
  17. 17.
    Torr, P.H.S., Zisserman, A., Maybank, S.: Robust detection of degenerate configurations for the fundamental matrix. CVIU 71(3), 312–333 (1998)Google Scholar
  18. 18.
    Tuytelaars, T., Van Gool, L.: Wide baseline stereo matching based on local, affinely invariant regions. In: Proc. BMVC, pp. 412–425 (2000)Google Scholar
  19. 19.
    Wallraven, C., Bulthoff, H.: Automatic acquisition of exemplar-based representations for recognition from image sequences. In: CVPR Workshop on Models vs. Exemplars (2001)Google Scholar
  20. 20.
    Zelnik-Manor, L., Irani, M.: Multi-view subspace constraints on homographies. In: Proc. ICCV (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Josef Sivic
    • 1
  • Frederik Schaffalitzky
    • 1
  • Andrew Zisserman
    • 1
  1. 1.Robotics Research Group, Department of Engineering ScienceUniversity of Oxford 

Personalised recommendations