Aligning plot synopses to videos for story-based retrieval

  • Makarand Tapaswi
  • Martin Bäuml
  • Rainer Stiefelhagen
Regular Paper


We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.


Story-based retrieval Text-video alignment Plot synopsis TV series 



This work was funded by the Deutsche Forschungsgemeinschaft (DFG — German Research Foundation) under contract no. STI-598/2-1. The views expressed herein are the authors’ responsibility and do not necessarily reflect those of DFG.


  1. 1.
    Buffy Plot Synopsis Text-Video Alignment Data. Accessed 3 July 2014
  2. 2.
    NLP Toolbox. Accessed 4 July 2014
  3. 3.
    SubRip. Accessed 4 July 2014
  4. 4.
    Whoosh - a Python full text indexing and search library. Accessed 4 July 2014
  5. 5.
    Alahari K, Seguin G, Sivic J, Laptev I (2013) Pose estimation and segmentation of people in 3D movies. In: IEEE International Conference on Computer VisionGoogle Scholar
  6. 6.
    Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: IEEE Conference on Computer Vision and Pattern RecognitionGoogle Scholar
  7. 7.
    Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008(246309):1–10Google Scholar
  8. 8.
    Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media IncGoogle Scholar
  9. 9.
    Bredin H, Poignant J, Tapaswi M, Fortier G, et al (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: European Conference on Computer vision Workshop on Information fusion in computer vision for concept recognitionGoogle Scholar
  10. 10.
    Cour T, Sapp B, Jordan C, Taskar B (2009) Learning from ambiguously labeled images. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  11. 11.
    Cour T, Sapp B, Nagle A, Taskar B (2012) Talking pictures : temporal grouping and dialog-supervised person recognition. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  12. 12.
    Demarty CH, Penet C, Scheld M, Ionescu B, Quang VL, Jiang YG (2013) The mediaeval 2013 affect task: violent scenes detection. In: Working notes Proceedings of the mediaeval 2013 WorkshopGoogle Scholar
  13. 13.
    Ercolessi P, Bredin H, Sénac C (2012) StoViz: story visualization of TV series. In: ACM MultimediaGoogle Scholar
  14. 14.
    Everingham M, Sivic J, Zisserman A (2006) Hello! My name is... Buffy—automatic naming of characters in TV video. In: British machine vision conferenceGoogle Scholar
  15. 15.
    Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRefGoogle Scholar
  16. 16.
    Freiburg B, Kamps J, Snoek C (2011) Crowdsourcing visual detectors for video search. In: ACM MultimediaGoogle Scholar
  17. 17.
    Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos input. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  18. 18.
    Habibian A, Snoek C (2013) Video2sentence and vice versa. In: ACM Multimedia demoGoogle Scholar
  19. 19.
    Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21CrossRefGoogle Scholar
  20. 20.
    Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  21. 21.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  22. 22.
    Law-To J, Chen L, Joly A, Laptev I, Buisson O, Gouet-Bruent V, Boujemaa N, Stentiford FI (2007) Video copy detection: a comparative study. In: ACM International Conference on Image and video retrievalGoogle Scholar
  23. 23.
    Law-To J, Grefenstette G, Gauvain JL (2009) VoxaleadNews: robust automatic segmentation of video into browsable content. In: ACM MultimediaGoogle Scholar
  24. 24.
    Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Computational natural language learningGoogle Scholar
  25. 25.
    Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming. IEEE Signal Process Mag 23(2):79–89CrossRefzbMATHGoogle Scholar
  26. 26.
    Liang C, Xu C, Cheng J, Min W, Lu H (2013) Script-to-movie : a computational framework for story movie composition. IEEE Trans Multimed 15(2):401–414CrossRefGoogle Scholar
  27. 27.
    Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  28. 28.
    Myers CS, Rabiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell Syst Tech J 60(7):1389–1409Google Scholar
  29. 29.
    Nagel H (2004) Steps toward a cognitive vision system. AI Mag 25(2):31–50MathSciNetGoogle Scholar
  30. 30.
    Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intel 34(12):2441–2453Google Scholar
  31. 31.
    Peng Y, Xiao J (2010) Story-based retrieval by learning and measuring the concept-based and content-based similarity. In: Advances in multimedia modelingGoogle Scholar
  32. 32.
    Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quenot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: InterspeechGoogle Scholar
  33. 33.
    Rasheed Z, Shah M (2005) Detection and representation of scenes in videos. IEEE Trans Multimed 7(6):1097–1105CrossRefGoogle Scholar
  34. 34.
    Rogers DF, Adams JA (1990) Mathematical elements for computer graphics, 2 edn. McGraw-Hill, New YorkGoogle Scholar
  35. 35.
    Sang J, Xu C (2010) Character-based movie summarization. In: ACM MultimediaGoogle Scholar
  36. 36.
    Sankar P, Jawahar CV, Zisserman A (2009) Subtitle-free movie to script alignment. In: British machine vision conferenceGoogle Scholar
  37. 37.
    Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  38. 38.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM Multimedia information retrievalGoogle Scholar
  39. 39.
    Snoek C, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Trans Multimed 9(5):975–986CrossRefGoogle Scholar
  40. 40.
    Snoek C, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322Google Scholar
  41. 41.
    Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM MultimediaGoogle Scholar
  42. 42.
    Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  43. 43.
    Tapaswi M, Bäuml M, Stiefelhagen R (2014) Story-based video retrieval in TV series using plot synopses. In: ACM International Conference on Multimedia retrievalGoogle Scholar
  44. 44.
    Tapaswi M, Bäuml M, Stiefelhagen R (2014) StoryGraphs: visualizing character interactions as a timeline. In: IEEE Conference on Computer vision and pattern recognitionGoogle Scholar
  45. 45.
    Tsoneva T, Barbieri M, Weda H (2007) Automated summarization of narrative video on a semantic level. In: International Conference on Semantic computingGoogle Scholar
  46. 46.
    Wang X, Liu Y, Wang D, Wu F (2013) Cross-media topic mining on Wikipedia. In: ACM MultimediaGoogle Scholar
  47. 47.
    Xu C, Zhang YF, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355CrossRefGoogle Scholar
  48. 48.
    Yusoff Y, Christmas W, Kittler J (1998) A study on automatic shot change detection. In: Multimedia Applications, Services and Techniques — ECMAST’98, vol. 1425. Springer, BerlinGoogle Scholar
  49. 49.
    Zaragoza H, Craswell N, Taylor M, Saria S, Robertson S (2004) Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TRECGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Makarand Tapaswi
    • 1
  • Martin Bäuml
    • 1
  • Rainer Stiefelhagen
    • 1
  1. 1.Computer Vision for Human Computer Interaction LabKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations