Abstract
We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.
Similar content being viewed by others
Notes
For \(z \sim 100\), \(N_S \sim 40\) and \(N_T \sim 700\) DTW3 takes a couple of minutes to solve with our unoptimized Matlab implementation.
References
Buffy Plot Synopsis Text-Video Alignment Data. https://cvhci.anthropomatik.kit.edu/~mtapaswi/projects/story_based_retrieval.html. Accessed 3 July 2014
NLP Toolbox. http://nlp.stanford.edu/software/. Accessed 4 July 2014
SubRip. http://en.wikipedia.org/wiki/SubRip. Accessed 4 July 2014
Whoosh - a Python full text indexing and search library. http://pypi.python.org/pypi/Whoosh. Accessed 4 July 2014
Alahari K, Seguin G, Sivic J, Laptev I (2013) Pose estimation and segmentation of people in 3D movies. In: IEEE International Conference on Computer Vision
Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: IEEE Conference on Computer Vision and Pattern Recognition
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008(246309):1–10
Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media Inc
Bredin H, Poignant J, Tapaswi M, Fortier G, et al (2012) Fusion of speech, faces and text for person identification in TV broadcast. In: European Conference on Computer vision Workshop on Information fusion in computer vision for concept recognition
Cour T, Sapp B, Jordan C, Taskar B (2009) Learning from ambiguously labeled images. In: IEEE Conference on Computer vision and pattern recognition
Cour T, Sapp B, Nagle A, Taskar B (2012) Talking pictures : temporal grouping and dialog-supervised person recognition. In: IEEE Conference on Computer vision and pattern recognition
Demarty CH, Penet C, Scheld M, Ionescu B, Quang VL, Jiang YG (2013) The mediaeval 2013 affect task: violent scenes detection. In: Working notes Proceedings of the mediaeval 2013 Workshop
Ercolessi P, Bredin H, Sénac C (2012) StoViz: story visualization of TV series. In: ACM Multimedia
Everingham M, Sivic J, Zisserman A (2006) Hello! My name is... Buffy—automatic naming of characters in TV video. In: British machine vision conference
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Freiburg B, Kamps J, Snoek C (2011) Crowdsourcing visual detectors for video search. In: ACM Multimedia
Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos input. In: IEEE Conference on Computer vision and pattern recognition
Habibian A, Snoek C (2013) Video2sentence and vice versa. In: ACM Multimedia demo
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21
Khosla A, Hamid R, Lin CJ, Sundaresan N (2013) Large-scale video summarization using web-image priors. In: IEEE Conference on Computer vision and pattern recognition
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer vision and pattern recognition
Law-To J, Chen L, Joly A, Laptev I, Buisson O, Gouet-Bruent V, Boujemaa N, Stentiford FI (2007) Video copy detection: a comparative study. In: ACM International Conference on Image and video retrieval
Law-To J, Grefenstette G, Gauvain JL (2009) VoxaleadNews: robust automatic segmentation of video into browsable content. In: ACM Multimedia
Lee H, Peirsman Y, Chang A, Chambers N, Surdeanu M, Jurafsky D (2011) Stanford’s multi-pass sieve coreference resolution system at the CoNLL-2011 shared task. In: Computational natural language learning
Li Y, Lee SH, Yeh CH, Kuo CC (2006) Techniques for movie content analysis and skimming. IEEE Signal Process Mag 23(2):79–89
Liang C, Xu C, Cheng J, Min W, Lu H (2013) Script-to-movie : a computational framework for story movie composition. IEEE Trans Multimed 15(2):401–414
Lin D, Fidler S, Kong C, Urtasun R (2014) Visual semantic search: retrieving videos via complex textual queries. In: IEEE Conference on Computer vision and pattern recognition
Myers CS, Rabiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell Syst Tech J 60(7):1389–1409
Nagel H (2004) Steps toward a cognitive vision system. AI Mag 25(2):31–50
Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intel 34(12):2441–2453
Peng Y, Xiao J (2010) Story-based retrieval by learning and measuring the concept-based and content-based similarity. In: Advances in multimedia modeling
Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quenot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech
Rasheed Z, Shah M (2005) Detection and representation of scenes in videos. IEEE Trans Multimed 7(6):1097–1105
Rogers DF, Adams JA (1990) Mathematical elements for computer graphics, 2 edn. McGraw-Hill, New York
Sang J, Xu C (2010) Character-based movie summarization. In: ACM Multimedia
Sankar P, Jawahar CV, Zisserman A (2009) Subtitle-free movie to script alignment. In: British machine vision conference
Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE Conference on Computer vision and pattern recognition
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: ACM Multimedia information retrieval
Snoek C, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. IEEE Trans Multimed 9(5):975–986
Snoek C, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322
Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: ACM Multimedia
Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: IEEE Conference on Computer vision and pattern recognition
Tapaswi M, Bäuml M, Stiefelhagen R (2014) Story-based video retrieval in TV series using plot synopses. In: ACM International Conference on Multimedia retrieval
Tapaswi M, Bäuml M, Stiefelhagen R (2014) StoryGraphs: visualizing character interactions as a timeline. In: IEEE Conference on Computer vision and pattern recognition
Tsoneva T, Barbieri M, Weda H (2007) Automated summarization of narrative video on a semantic level. In: International Conference on Semantic computing
Wang X, Liu Y, Wang D, Wu F (2013) Cross-media topic mining on Wikipedia. In: ACM Multimedia
Xu C, Zhang YF, Zhu G, Rui Y, Lu H, Huang Q (2008) Using webcast text for semantic event detection in broadcast sports video. IEEE Trans Multimed 10(7):1342–1355
Yusoff Y, Christmas W, Kittler J (1998) A study on automatic shot change detection. In: Multimedia Applications, Services and Techniques — ECMAST’98, vol. 1425. Springer, Berlin
Zaragoza H, Craswell N, Taylor M, Saria S, Robertson S (2004) Microsoft Cambridge at TREC-13: Web and HARD tracks. In: Proceedings of TREC
Acknowledgments
This work was funded by the Deutsche Forschungsgemeinschaft (DFG — German Research Foundation) under contract no. STI-598/2-1. The views expressed herein are the authors’ responsibility and do not necessarily reflect those of DFG.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tapaswi, M., Bäuml, M. & Stiefelhagen, R. Aligning plot synopses to videos for story-based retrieval. Int J Multimed Info Retr 4, 3–16 (2015). https://doi.org/10.1007/s13735-014-0065-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-014-0065-9