International Journal of Computer Vision

, Volume 67, Issue 2, pp 189-210

First online:

Object Level Grouping for Video Shots

  • Josef SivicAffiliated withDepartment of Engineering Science, University of Oxford
  • , Frederik SchaffalitzkyAffiliated withDepartment of Engineering Science, University of Oxford
  • , Andrew ZissermanAffiliated withDepartment of Engineering Science, University of Oxford

Rent the article at a discount

Rent now

* Final gross prices may vary according to local VAT.

Get Access


We describe a method for automatically obtaining object representations suitable for retrieval from generic video shots. The object representation consists of an association of frame regions. These regions provide exemplars of the object’s possible visual appearances.

Two ideas are developed: (i) associating regions within a single shot to represent a deforming object; (ii) associating regions from the multiple visual aspects of a 3D object, thereby implicitly representing 3D structure. For the association we exploit temporal continuity (tracking) and wide baseline matching of affine covariant regions.

In the implementation there are three areas of novelty: First, we describe a method to repair short gaps in tracks. Second, we show how to join tracks across occlusions (where many tracks terminate simultaneously). Third, we develop an affine factorization method that copes with motion degeneracy.

We obtain tracks that last throughout the shot, without requiring a 3D reconstruction. The factorization method is used to associate tracks into object-level groups, with common motion. The outcome is that separate parts of an object that are not simultaneously visible (such as the front and back of a car, or the front and side of a face) are associated together. In turn this enables object-level matching and recognition throughout a video.

We illustrate the method on the feature film “Groundhog Day.” Examples are given for the retrieval of deforming objects (heads, walking people) and rigid objects (vehicles, locations).


3D object retrieval in videos tracking affine covariant regions independent motion segmentation robust affine factorization