Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach

  • Stefano Pini
  • Marcella CorniaEmail author
  • Lorenzo Baraldi
  • Rita Cucchiara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10485)


Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic “someone” tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.


Video captioning Naming Datasets Deep learning 


  1. 1.
    Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: CVPR (2009)Google Scholar
  2. 2.
    Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: ICCV (2013)Google Scholar
  3. 3.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  4. 4.
    Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is.. Buffy”-Automatic Naming of Characters in TV Video. In: British Machine Vision Conference (2006)Google Scholar
  5. 5.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics (2010)Google Scholar
  6. 6.
    Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). doi: 10.1007/978-3-319-46487-9_6 CrossRefGoogle Scholar
  7. 7.
    Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR (2016)Google Scholar
  8. 8.
    Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Conference on Empirical Methods in Natural Language Processing (2015)Google Scholar
  9. 9.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  10. 10.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR (2016)Google Scholar
  11. 11.
    Parkhi, O.M., Rahtu, E., Zisserman, A.: Its in the bag: stronger supervision for automated face labelling. In: ICCV Workshops (2015)Google Scholar
  12. 12.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)Google Scholar
  13. 13.
    Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L.: Linking people in videos with “their” names using coreference resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 95–110. Springer, Cham (2014). doi: 10.1007/978-3-319-10590-1_7 Google Scholar
  14. 14.
    Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Gall, J., Gehler, P., Leibe, B. (eds.) GCPR 2015. LNCS, vol. 9358, pp. 209–221. Springer, Cham (2015). doi: 10.1007/978-3-319-24947-6_17 CrossRefGoogle Scholar
  15. 15.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)Google Scholar
  16. 16.
    Rohrbach, A., Rohrbach, M., Tang, S., Oh, S.J., Schiele, B.: Generating descriptions with grounded and co-referenced people. In: CVPR (2017)Google Scholar
  17. 17.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  18. 18.
    Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M., Schiele, B.: Speaking the same language: matching machine to human captions by adversarial training. arXiv preprint arXiv:1703.10476 (2017)
  19. 19.
    Sivic, J., Everingham, M., Zisserman, A.: Who are you?-learning person specific classifiers from video. In: CVPR (2009)Google Scholar
  20. 20.
    Tapaswi, M., Bäuml, M., Stiefelhagen, R.: Knock! Knock! Who is it? Probabilistic person identification in TV-series. In: CVPR (2012)Google Scholar
  21. 21.
    Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015)
  22. 22.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  23. 23.
    Venugopalan, S., Hendricks, L.A., Mooney, R., Saenko, K.: Improving LSTM-based video description with linguistic knowledge mined from text. In: Conference on Empirical Methods in Natural Language Processing (2016)Google Scholar
  24. 24.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV (2015)Google Scholar
  25. 25.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: North American Chapter of the Association for Computational Linguistics (2014)Google Scholar
  26. 26.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV (2015)Google Scholar
  27. 27.
    Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multi-task cascaded convolutional networks. arXiv preprint arXiv:1604.02878 (2016)

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Stefano Pini
    • 1
  • Marcella Cornia
    • 1
    Email author
  • Lorenzo Baraldi
    • 1
  • Rita Cucchiara
    • 1
  1. 1.Dipartimento di Ingegneria “Enzo Ferrari”Università degli Studi di Modena e Reggio EmiliaModenaItaly

Personalised recommendations