International Journal of Computer Vision

, Volume 67, Issue 2, pp 189–210

Object Level Grouping for Video Shots


  • Josef Sivic
    • Department of Engineering ScienceUniversity of Oxford
  • Frederik Schaffalitzky
    • Department of Engineering ScienceUniversity of Oxford
  • Andrew Zisserman
    • Department of Engineering ScienceUniversity of Oxford

DOI: 10.1007/s11263-005-4264-y

Cite this article as:
Sivic, J., Schaffalitzky, F. & Zisserman, A. Int J Comput Vision (2006) 67: 189. doi:10.1007/s11263-005-4264-y


We describe a method for automatically obtaining object representations suitable for retrieval from generic video shots. The object representation consists of an association of frame regions. These regions provide exemplars of the object’s possible visual appearances.

Two ideas are developed: (i) associating regions within a single shot to represent a deforming object; (ii) associating regions from the multiple visual aspects of a 3D object, thereby implicitly representing 3D structure. For the association we exploit temporal continuity (tracking) and wide baseline matching of affine covariant regions.

In the implementation there are three areas of novelty: First, we describe a method to repair short gaps in tracks. Second, we show how to join tracks across occlusions (where many tracks terminate simultaneously). Third, we develop an affine factorization method that copes with motion degeneracy.

We obtain tracks that last throughout the shot, without requiring a 3D reconstruction. The factorization method is used to associate tracks into object-level groups, with common motion. The outcome is that separate parts of an object that are not simultaneously visible (such as the front and back of a car, or the front and side of a face) are associated together. In turn this enables object-level matching and recognition throughout a video.

We illustrate the method on the feature film “Groundhog Day.” Examples are given for the retrieval of deforming objects (heads, walking people) and rigid objects (vehicles, locations).


3D object retrieval in videostracking affine covariant regionsindependent motion segmentationrobust affine factorization

Copyright information

© Springer Science + Business Media, Inc. 2006