Advertisement

M-VAD names: a dataset for video captioning with naming

  • Stefano Pini
  • Marcella CorniaEmail author
  • Federico Bolelli
  • Lorenzo Baraldi
  • Rita Cucchiara
Article
  • 63 Downloads

Abstract

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic “someone” tag. The lack of movie description datasets with characters’ visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63 k visual tracks and 34 k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the “someone” tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

Keywords

Video captioning Naming Dataset Deep learning 

Notes

Acknowledgements

We acknowledge Carmen Sabia and Luca Bergamini for supporting us during the annotation of the M-VAD Names dataset. We also gratefully acknowledge Facebook AI Research and Panasonic Corporation for the donation of the GPUs used in this work.

References

  1. 1.
    Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  2. 2.
    Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  3. 3.
    Bojanowski P, Bach F, Laptev I, Ponce J, Schmid C, Sivic J (2013) Finding actors and actions in movies. In: IEEE international conference on computer visionGoogle Scholar
  4. 4.
    Ding L, Yilmaz A (2010) Learning relations among movie characters: a social network perspective. In: European conference on computer visionGoogle Scholar
  5. 5.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  6. 6.
    Everingham M, Sivic J, Zisserman A (2006) Hello! my name is... Buffy–automatic naming of characters in TV video. In: British machine vision conferenceGoogle Scholar
  7. 7.
    Guo Y, Zhang L, Hu Y, He X, Gao J (2016) MS-Celeb-1m: a dataset and benchmark for large-scale face recognition. In: European conference on computer visionGoogle Scholar
  8. 8.
    Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: describing novel object categories without paired training data. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  9. 9.
    Jin S, Su H, Stauffer C, Learned-Miller E (2017) End-to-end face detection and cast grouping in movies using Erdos-renyí clustering. In: IEEE international conference on computer visionGoogle Scholar
  10. 10.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  11. 11.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  12. 12.
    Kiros R, Salakhutdinov R, Zemel R (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
  13. 13.
    Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-captioning events in videos. In: IEEE international conference on computer visionGoogle Scholar
  14. 14.
    Kuhn HW (1955) The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2):83–97MathSciNetCrossRefGoogle Scholar
  15. 15.
    Maaten LVD, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605zbMATHGoogle Scholar
  16. 16.
    Marín-Jiménez MJ, Zisserman A, Eichner M, Ferrari V (2014) Detecting people looking at each other in videos. Int J Comput Vis 106(3):282–296CrossRefGoogle Scholar
  17. 17.
    Miech A, Alayrac JB, Bojanowski P, Laptev I, Sivic J (2017) Learning from video and text via large-scale discriminative clustering. In: IEEE international conference on computer visionGoogle Scholar
  18. 18.
    Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  19. 19.
    Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  20. 20.
    Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intell 34 (12):2441–2453CrossRefGoogle Scholar
  21. 21.
    Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processingGoogle Scholar
  22. 22.
    Pini S, Cornia M, Baraldi L, Cucchiara R (2017) Towards video captioning with naming: a novel dataset and a multi-modal approach. In: International conference on image analysis and processingGoogle Scholar
  23. 23.
    Ramanathan V, Joulin A, Liang P, Fei-Fei L (2014) Linking people in videos with “their” names using coreference resolution. In: European conference on computer visionGoogle Scholar
  24. 24.
    Rohrbach A, Rohrbach M, Schiele B (2015) The long-short story of movie description. In: German conference on pattern recognitionGoogle Scholar
  25. 25.
    Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  26. 26.
    Rohrbach A, Rohrbach M, Tang S, Oh SJ, Schiele B (2017) Generating descriptions with grounded and co-referenced people. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  27. 27.
    Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  28. 28.
    Shetty R, Rohrbach M, Hendricks LA, Fritz M, Schiele B (2017) Speaking the same language: matching machine to human captions by adversarial training. In: IEEE international conference on computer visionGoogle Scholar
  29. 29.
    Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  30. 30.
    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics 2(1):207–218Google Scholar
  31. 31.
    Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? probabilistic person identification in TV-series. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  32. 32.
    Torabi A, Pal C, Larochelle H, Courville A (2015)Google Scholar
  33. 33.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer visionGoogle Scholar
  34. 34.
    Van Der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245MathSciNetzbMATHGoogle Scholar
  35. 35.
    Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. North American Chapter of the Association for Computational LinguisticsGoogle Scholar
  36. 36.
    Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Conf. on empirical methods in natural language processingGoogle Scholar
  37. 37.
    Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE international conference on computer visionGoogle Scholar
  38. 38.
    Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: towards understanding human-centric situations from videos. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  39. 39.
    Ward JHJ (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244MathSciNetCrossRefGoogle Scholar
  40. 40.
    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE international conference on computer visionGoogle Scholar
  41. 41.
    Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar
  42. 42.
    Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503CrossRefGoogle Scholar
  43. 43.
    Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: IEEE international conference on computer vision and pattern recognitionGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Engineering “Enzo Ferrari”University of Modena and Reggio EmiliaModenaItaly

Personalised recommendations