Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

M-VAD names: a dataset for video captioning with naming

Abstract

Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic “someone” tag. The lack of movie description datasets with characters’ visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63 k visual tracks and 34 k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the “someone” tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    The proposed dataset is publicly available at https://github.com/aimagelab/mvad-names-dataset.

  2. 2.

    https://spacy.io

References

  1. 1.

    Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning. In: IEEE international conference on computer vision and pattern recognition

  2. 2.

    Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE international conference on computer vision and pattern recognition

  3. 3.

    Bojanowski P, Bach F, Laptev I, Ponce J, Schmid C, Sivic J (2013) Finding actors and actions in movies. In: IEEE international conference on computer vision

  4. 4.

    Ding L, Yilmaz A (2010) Learning relations among movie characters: a social network perspective. In: European conference on computer vision

  5. 5.

    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE international conference on computer vision and pattern recognition

  6. 6.

    Everingham M, Sivic J, Zisserman A (2006) Hello! my name is... Buffy–automatic naming of characters in TV video. In: British machine vision conference

  7. 7.

    Guo Y, Zhang L, Hu Y, He X, Gao J (2016) MS-Celeb-1m: a dataset and benchmark for large-scale face recognition. In: European conference on computer vision

  8. 8.

    Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: describing novel object categories without paired training data. In: IEEE international conference on computer vision and pattern recognition

  9. 9.

    Jin S, Su H, Stauffer C, Learned-Miller E (2017) End-to-end face detection and cast grouping in movies using Erdos-renyí clustering. In: IEEE international conference on computer vision

  10. 10.

    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE international conference on computer vision and pattern recognition

  11. 11.

    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE international conference on computer vision and pattern recognition

  12. 12.

    Kiros R, Salakhutdinov R, Zemel R (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

  13. 13.

    Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-captioning events in videos. In: IEEE international conference on computer vision

  14. 14.

    Kuhn HW (1955) The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2):83–97

  15. 15.

    Maaten LVD, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

  16. 16.

    Marín-Jiménez MJ, Zisserman A, Eichner M, Ferrari V (2014) Detecting people looking at each other in videos. Int J Comput Vis 106(3):282–296

  17. 17.

    Miech A, Alayrac JB, Bojanowski P, Laptev I, Sivic J (2017) Learning from video and text via large-scale discriminative clustering. In: IEEE international conference on computer vision

  18. 18.

    Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE international conference on computer vision and pattern recognition

  19. 19.

    Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE international conference on computer vision and pattern recognition

  20. 20.

    Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intell 34 (12):2441–2453

  21. 21.

    Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing

  22. 22.

    Pini S, Cornia M, Baraldi L, Cucchiara R (2017) Towards video captioning with naming: a novel dataset and a multi-modal approach. In: International conference on image analysis and processing

  23. 23.

    Ramanathan V, Joulin A, Liang P, Fei-Fei L (2014) Linking people in videos with “their” names using coreference resolution. In: European conference on computer vision

  24. 24.

    Rohrbach A, Rohrbach M, Schiele B (2015) The long-short story of movie description. In: German conference on pattern recognition

  25. 25.

    Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: IEEE international conference on computer vision and pattern recognition

  26. 26.

    Rohrbach A, Rohrbach M, Tang S, Oh SJ, Schiele B (2017) Generating descriptions with grounded and co-referenced people. In: IEEE international conference on computer vision and pattern recognition

  27. 27.

    Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE international conference on computer vision and pattern recognition

  28. 28.

    Shetty R, Rohrbach M, Hendricks LA, Fritz M, Schiele B (2017) Speaking the same language: matching machine to human captions by adversarial training. In: IEEE international conference on computer vision

  29. 29.

    Sivic J, Everingham M, Zisserman A (2009) Who are you? Learning person specific classifiers from video. In: IEEE international conference on computer vision and pattern recognition

  30. 30.

    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics 2(1):207–218

  31. 31.

    Tapaswi M, Bäuml M, Stiefelhagen R (2012) Knock! Knock! Who is it? probabilistic person identification in TV-series. In: IEEE international conference on computer vision and pattern recognition

  32. 32.

    Torabi A, Pal C, Larochelle H, Courville A (2015)

  33. 33.

    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision

  34. 34.

    Van Der Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15(1):3221–3245

  35. 35.

    Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. North American Chapter of the Association for Computational Linguistics

  36. 36.

    Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Conf. on empirical methods in natural language processing

  37. 37.

    Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE international conference on computer vision

  38. 38.

    Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: towards understanding human-centric situations from videos. In: IEEE international conference on computer vision and pattern recognition

  39. 39.

    Ward JHJ (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

  40. 40.

    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE international conference on computer vision

  41. 41.

    Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE international conference on computer vision and pattern recognition

  42. 42.

    Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503

  43. 43.

    Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: IEEE international conference on computer vision and pattern recognition

Download references

Acknowledgements

We acknowledge Carmen Sabia and Luca Bergamini for supporting us during the annotation of the M-VAD Names dataset. We also gratefully acknowledge Facebook AI Research and Panasonic Corporation for the donation of the GPUs used in this work.

Author information

Correspondence to Marcella Cornia.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pini, S., Cornia, M., Bolelli, F. et al. M-VAD names: a dataset for video captioning with naming. Multimed Tools Appl 78, 14007–14027 (2019). https://doi.org/10.1007/s11042-018-7040-z

Download citation

Keywords

  • Video captioning
  • Naming
  • Dataset
  • Deep learning