Improving the Quality of Video-to-Language Models by Optimizing Annotation of the Training Material

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10704)


Automatic video captioning is one of the ultimate challenges of Natural Language Processing, boosted by the omnipresence of video and the release of large-scale annotated video benchmarks. However, the specificity and quality of the captions vary considerably, having an adverse effect on the quality of the trained captioning models. In this work, we address this issue by proposing automatic strategies for optimizing the annotations of video material, removing annotations that are not semantically relevant and generating new and more informative captions. We evaluate our approach on the MSR-VTT challenge with a state-of-the-art deep learning video-to-language model. Our code is available at


Video-to-language Video captioning Video understanding Text annotation optimization Semantic sentence similarity 



This work is partly supported by the Spanish Ministry of Economy and Competitiveness under the Ramon y Cajal fellowships, and the Kristina project funded by the European Union Horizon 2020 research and innovation programme under grant agreement No 645012. The Titan X GPU used for this research was donated by the NVIDIA Corporation.


  1. 1.
    Awad, G., et al.: Trecvid 2016: evaluating video search, video event detection, localization, and hyperlinking. In: Proceedings of TRECVID, vol. 2016 (2016)Google Scholar
  2. 2.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247 (2014)Google Scholar
  3. 3.
    Barzilay, R., McKeown, K.R.: Sentence fusion for multi-document news summarization. CL 31(3), 297–328 (2005)Google Scholar
  4. 4.
    Bengio, Y., et al.: A neural probabilistic language model. JMLR 3, 1137–1155 (2003)zbMATHGoogle Scholar
  5. 5.
    Bing, L., et al.: Abstractive multi-document summarization via phrase selection and merging. arXiv preprint arXiv:1506.01597 (2015)
  6. 6.
    Boudin, F., Morin, E.: Keyphrase extraction for n-best reranking in multi-sentence compression. In: NAACL (2013)Google Scholar
  7. 7.
    Cheung, J.C.K., Penn, G.: Unsupervised sentence enhancement for automatic summarization. In: EMNLP, pp. 775–786 (2014)Google Scholar
  8. 8.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th ICML, pp. 160–167. ACM (2008)Google Scholar
  9. 9.
    Elsner, M., Santhanam, D.: Learning to fuse disparate sentences. In: Proceedings of the Workshop on Monolingual Text-To-Text Generation, pp. 54–63. ACL (2011)Google Scholar
  10. 10.
    Filippova, K.: Multi-sentence compression: finding shortest paths in word graphs. In: Proceedings of the 23rd ICCL, pp. 322–330. ACL (2010)Google Scholar
  11. 11.
    Filippova, K., Strube, M.: Sentence fusion via dependency graph compression. In: Proceedings of the CEMNLP, pp. 177–185. ACL (2008)Google Scholar
  12. 12.
    Han, L., et al.: UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: * SEM@ NAACL-HLT, pp. 44–52 (2013)Google Scholar
  13. 13.
    Iacobacci, I., Pilehvar, M.T., Navigli, R.: SensEmbed: learning sense embeddings for word and relational similarity. In: ACL, vol. 1, pp. 95–105 (2015)Google Scholar
  14. 14.
    Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  15. 15.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  16. 16.
    Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th ICML, pp. 641–648. ACM (2007)Google Scholar
  17. 17.
    Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 216–225. ACL (2010)Google Scholar
  18. 18.
    Ramanishka, V., et al.: Multimodal video description. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 1092–1096. ACM (2016)Google Scholar
  19. 19.
    Ramanishka, V., et al.: Top-down visual saliency guided by captions. In: arXiv preprint arXiv:1612.07360 (2016)
  20. 20.
    Thadani, K., McKeown, K.: Supervised sentence fusion with single-stage inference. In: IJCNLP, pp. 1410–1418 (2013)Google Scholar
  21. 21.
    Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the ACL, pp. 384–394. ACL (2010)Google Scholar
  22. 22.
    Tzouridis, E., Nasir, J.A., Brefeld, U.: Learning to summarise related sentences. In: COLING, pp. 1636–1647 (2014)Google Scholar
  23. 23.
    Vadapalli, R. et al.: SSAS: semantic similarity for abstractive summarization. In: Proceedings of the IJCNLP (2017)Google Scholar
  24. 24.
    Xu, J., et al.: MSR-VTT: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on CVPR, pp. 5288–5296 (2016)Google Scholar
  25. 25.
    Yu, M., Dredze, M.: Improving lexical embeddings with semantic knowledge. In: ACL, vol. 2, pp. 545–550 (2014)Google Scholar
  26. 26.
    Zou, W.Y., et al.: Bilingual word embeddings for phrase-based machine translation. In: EMNLP, pp. 1393–1398 (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Department of Information and Communication TechnologiesPompeu Fabra UniversityBarcelonaSpain
  2. 2.Catalan Institute for Research and Advanced Studies (ICREA)BarcelonaSpain

Personalised recommendations