Movie/Script: Alignment and Parsing of Video and Text Transcription

  • Timothee Cour
  • Chris Jordan
  • Eleni Miltsakaki
  • Ben Taskar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5305)


Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales “in the wild”. Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highly-varied datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels and shots are reordered into a sequence of long continuous tracks or threads which allow for more accurate tracking of people, actions and objects. Scene segmentation, alignment, and shot threading are formulated as inference in a unified generative model and a novel hierarchical dynamic programming algorithm that can handle alignment and jump-limited reorderings in linear time is presented. We present quantitative and qualitative results on movie alignment and parsing, and use the recovered structure to improve character naming and retrieval of common actions in several episodes of popular TV series.


Travel Salesman Problem Hamiltonian Path Face Track Scene Segmentation Action Retrieval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Supplementary material

978-3-540-88693-8_12_MOESM1_ESM.mpg (9.8 mb)
Supplementary material(9,998 KB)


  1. 1.
    Huang, G., Jain, V., Learned-Miller, E.: Unsupervised joint alignment of complex images. In: International Conference on Computer Vision, pp. 1–8 (2007)Google Scholar
  2. 2.
    Ramanan, D., Baker, S., Kakade, S.: Leveraging archival video for building face datasets. In: International Conference on Computer Vision, pp. 1–8 (2007)Google Scholar
  3. 3.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (2008),
  4. 4.
    Sivic, J., Everingham, M., Zisserman, A.: Person spotting: video shot retrieval for face sets. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, Springer, Heidelberg (2005)Google Scholar
  5. 5.
    Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is.. buffy – automatic naming of characters in tv video. In: Proceedings of the British Machine Vision Conference (2006)Google Scholar
  6. 6.
    Lienhart, R.: Reliable transition detection in videos: A survey and practitioner’s guide. Int. Journal of Image and Graphics (2001)Google Scholar
  7. 7.
    Ngo, C.-W., Pong, T.C., Zhang, H.J.: Recent advances in content-based video analysis. International Journal of Image and Graphics 1, 445–468 (2001)CrossRefGoogle Scholar
  8. 8.
    Zhai, Y., Shah, M.: Video scene segmentation using markov chain monte carlo. IEEE Transactions on Multimedia 8, 686–697 (2006)CrossRefGoogle Scholar
  9. 9.
    Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comp. Vision Image Understanding (1998)Google Scholar
  10. 10.
    Kender, J., Yeo, B.: Video scene segmentation via continuous video coherence. In: IEEE Conference on Computer Vision and Pattern Recognition (1998)Google Scholar
  11. 11.
    Balas, E., Simonetti, N.: Linear time dynamic programming algorithms for new classes of restricted tsps: A computational study. INFORMS Journal on Computing 13, 56–75 (2001)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Myers, C.S., Rabiner, L.R.: A comparative study of several dynamic time-warping algorithms for connected word recognition. The Bell System Technical Journal (1981)Google Scholar
  13. 13.
    Viola, P.A., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision 57, 137–154 (2004)CrossRefGoogle Scholar
  14. 14.
    Everingham, M.R., Sivic, J., Zisserman, A.: Hello! my name is buffy: Automatic naming of characters in tv video. In: BMVC, vol. III, p. 899 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Timothee Cour
    • 1
  • Chris Jordan
    • 1
  • Eleni Miltsakaki
    • 1
  • Ben Taskar
    • 1
  1. 1.University of PennsylvaniaPhiladelphia, PAUSA

Personalised recommendations