Skip to main content

MSVD-Turkish: a comprehensive multimodal video dataset for integrated vision and language research in Turkish

Abstract

Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large-scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English–Turkish descriptions also enable the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. Note that the original MSVD dataset also contains annotations obtained for many different languages. The number of these multilingual descriptions is, however, very low compared to the number of original English descriptions. Moreover, these descriptions were not shared with the community.

  2. We make our dataset publicly available at https://hucvl.github.io/MSVD-Turkish/.

  3. Pre-processing consists of lowercasing, length filtering with minimum token count set to 5, punctuation removal and deduplication.

  4. It should be noted we reused the hyper-parameters from the NMT experiments and did not conduct a hyper-parameter search for our MMT models.

References

  • Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets and evaluation metrics and description. ACM Comput Surv 52(6):1–37

    Article  Google Scholar 

  • Akin AA, Akin MD (2007) Zemberek, an open source NLP framework for Turkic languages

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proc. International Conference on Learning Representations (ICLR), San Diego, California, USA, arXiv:1409.0473

  • Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: Proc. 28th Conference on Uncertainty in Artificial Intelligence (UAI2012), Catalina Island, California, USA, arXiv:1204.2742,

  • Barrault L, Bougares F, Specia L, Lala C, Elliott D, Frank S (2018) Findings of the third shared task on multimodal machine translation. In: Proc. Third Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Brussels, Belgium, pp 308–327, http://www.aclweb.org/anthology/W18-6402

  • Caglayan O (2019) Multimodal Machine Translation. Theses, Université du Maine, https://tel.archives-ouvertes.fr/tel-02309868

  • Caglayan O, Aransa W, Bardet A, García-Martínez M, Bougares F, Barrault L, Masana M, Herranz L, van de Weijer J (2017a) LIUM-CVC submissions for WMT17 multimodal translation task. In: Proc. Second Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Copenhagen, Denmark, pp 432–439, http://www.aclweb.org/anthology/W17-4746

  • Caglayan O, Aransa W, Wang Y, Masana M, García-Martínez M, Bougares F, Barrault L, van de Weijer J (2016a) Does multimodality help human and machine for translation and image captioning? In: Proc. First Conference on Machine Translation, Association for Computational Linguistics, Berlin, Germany, pp 627–633, http://www.aclweb.org/anthology/W/W16/W16-2358

  • Caglayan O, Barrault L, Bougares F (2016b) Multimodal attention for neural machine translation. Computing Research Repository arXiv:1609.03976

  • Caglayan O, García-Martínez M, Bardet A, Aransa W, Bougares F, Barrault L (2017b) NMTPY: A flexible toolkit for advanced neural machine translation systems. Prague Bull Math Linguistics 109:15–28. https://doi.org/10.1515/pralin-2017-0035, https://ufal.mff.cuni.cz/pbml/109/art-caglayan-et-al.pdf

  • Calixto I, Elliott D, Frank S (2016) DCU-UvA multimodal MT system report. In: Proceedings of the First Conference on Machine Translation, Association for Computational Linguistics, Berlin, Germany, pp 634–638, http://www.aclweb.org/anthology/W/W16/W16-2359

  • Calixto I, Liu Q (2017) Incorporating global visual features into attention-based neural machine translation. In: Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Copenhagen, Denmark, pp 992–1003, https://www.aclweb.org/anthology/D17-1105

  • Chen D, Dolan W (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL): Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 190–200, https://www.aclweb.org/anthology/P11-1020

  • Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: Picking informative frames for video captioning. In: Proc. European Conference on Computer Vision (ECCV), Munich, Germany, pp 367–384

  • Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp 1724–1734, http://www.aclweb.org/anthology/D14-1179

  • Das P, Xu C, Doell R, Corso J (2013) A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Portland, Oregon, USA, pp 2634–2641

  • Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proc. 9th Workshop on Statistical Machine Translation, Association for Computational Linguistics, Baltimore, Maryland, USA, pp 376–380, https://doi.org/10.3115/v1/W14-3348,

  • Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, arXiv:1411.4389

  • Elliott D, Frank S, Barrault L, Bougares F, Specia L (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proc. 2nd Conference on Machine Translation, Volume 2: Shared Task Papers, Association for Computational Linguistics, Copenhagen, Denmark, pp 215–233, http://www.aclweb.org/anthology/W17-4718

  • Elliott D, Frank S, Sima’an K, Specia L (2016) Multi30k: Multilingual english-german image descriptions. In: Proc. 5th Workshop on Vision and Language, Association for Computational Linguistics, Berlin, Germany, pp 70–74, http://anthology.aclweb.org/W16-3210

  • Elliott D, Kádár À (2017) Imagination improves multimodal translation. In: Proc. 8th International Joint Conference on Natural Language Processing (IJCNLP) (Volume 1: Long Papers), Asian Federation of Natural Language Processing, Taipei, Taiwan, pp 130–141, http://aclweb.org/anthology/I17-1014

  • Gella S, Lewis M, Rohrbach M (2018) A dataset for telling the stories of social media videos. In: Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Brussels, Belgium, pp 968–974, https://doi.org/10.18653/v1/D18-1117,

  • Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proc. IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, pp 2712–2719, http://www.cs.utexas.edu/users/ai-labpub-view.php?PubID=127409

  • Hakeem A, Sheikh Y, Shah M (2004) \(case^e\): A hierarchical event representation for the analysis of videos. Proc. Association for the Advancement of Artificial Intelligence (AAAI), San Jose, California, USA, pp 263–268

  • Hanckmann P, Schutte K, Burghouts GJ (2012) Automated textual descriptions for a wide range of video events with 48 human actions. In: Proc. European Conference on Computer Vision (ECCV), Firenze, Italy

  • He K, Xiangyu Z, Shaoqing R, Sun J (2016) Deep residual learning for image recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, pp 770–778, https://doi.org/10.1109/CVPR.2016.90

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Huang PY, Liu F, Shiang SR, Oh J, Dyer C (2016) Attention-based multimodal neural machine translation. In: Proc. First Conference on Machine Translation, Association for Computational Linguistics, Berlin, Germany, pp 639–645, http://www.aclweb.org/anthology/W/W16/W16-2360

  • Inan H, Khosravi K, Socher R (2016) Tying word vectors and word classifiers: A loss framework for language modeling. CoRR arXiv:1611.01462,

  • Kingma D, Ba J (2014) Adam: A method for stochastic optimization. CoRR arXiv:1412.6980

  • Kojima A, Tamura T, Fukunaga K (2012) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vision (IJCV) 50(2):171–184

    Article  Google Scholar 

  • Krippendorff K (1970) Estimating the reliability, systematic error and random error of interval data. Educ Psychol Measur 30(1):61–70. https://doi.org/10.1177/001316447003000105

    Article  Google Scholar 

  • Krishna R, Hata K, Ren F, Li F, Niebles JC (2017) Dense-captioning events in videos. arXiv:1705.00754,

  • Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, USA, pp 10–19, https://www.aclweb.org/anthology/W13-1302

  • Kudo T, Richardson J (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, Association for Computational Linguistics, Brussels, Belgium, pp 66–71, http://www.aclweb.org/anthology/D18-2012

  • Libovický J, Helcl J (2017) Attention strategies for multi-source sequence-to-sequence learning. In: Proc. 55th Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 2: Short Papers), Association for Computational Linguistics, Vancouver, Canada, pp 196–202, https://doi.org/10.18653/v1/P17-2031

  • Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. Proc. Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, Spain, pp 74–81

  • Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: Common objects in context. In: Proc. European Conference on Computer Vision (ECCV), Springer, Zurich, Switzerland, pp 740–755

  • Li Y, Song Y, Cao L, Tetreault JR, Goldberg L, Jaimes A, Luo J (2016) TGIF: A new dataset and benchmark on animated GIF description. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, arXiv:1604.02748,

  • Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M (2019) fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In: Proc. Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Minneapolis, Minnesota, USA

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: A method for automatic evaluation of machine translation. In: Proc. 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318, https://doi.org/10.3115/1073083.1073135

  • Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2015) Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, pp 2641–2649, https://doi.org/10.1109/ICCV.2015.303

  • Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics, Brussels, Belgium, pp 186–191, https://doi.org/10.18653/v1/W18-6319

  • Press O, Wolf L (2017) Using the output embedding to improve language models. In: Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, arXiv:1608.05859,

  • Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD (2020) Stanza: A Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, https://nlp.stanford.edu/pubs/qi2020stanza.pdf

  • Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36

    Article  Google Scholar 

  • Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Providence, Rhode Island, USA, pp 1194–1201

  • Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proc. IEEE International Conference on Computer Vision (ICCV), IEEE Computer Society, Sydney, Australia, pp 433–440

  • Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, pp 3202–3212, https://doi.org/10.1109/CVPR.2015.7298940

  • Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vision 123(1):94–120. https://doi.org/10.1007/s11263-016-0987-1

    Article  Google Scholar 

  • Senina A, Rohrbach M, Qiu W, Friedrich A, Amin S, Andriluka M, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. arXiv:1403.6173

  • Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proc. 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp 1715–1725, http://www.aclweb.org/anthology/P16-1162

  • Sennrich R, Firat O, Cho K, Birch-Mayne A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Miceli Barone A, Mokry J, Nadejde M (2017) Nematus: a toolkit for neural machine translation. In: Proceedings of the EACL 2017 Software Demonstrations, Association for Computational Linguistics (ACL), Valencia, Spain, pp 65–68

  • Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: Crowdsourcing data collection for activity understanding. arXiv:1604.01753,

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, cite arxiv:1409.1556

  • Specia L, Frank S, Sima’an K, Elliott D (2016) A shared task on multimodal machine translation and crosslingual image description. In: Proc. of the First Conference on Machine Translation, Association for Computational Linguistics, Berlin, Germany, pp 543–553, http://www.aclweb.org/anthology/W/W16/W16-2346

  • Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  • Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. arXiv:1502.04681,

  • Sulubacak U, Caglayan O, Grönroos SA, Rouhe A, Elliott D, Specia L, Tiedemann J (2019) Multimodal machine translation through visuals and speech. arXiv:191112798

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proc. 27th International Conference on Neural Information Processing Systems (NeurIPS), MIT Press, Montreal, Canada, pp 3104–3112, http://dl.acm.org/citation.cfm?id=2969033.2969173

  • Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proc. International Conference on Computational Linguistics (COLING): Technical Papers, Dublin, Ireland, pp 1218–1227, https://www.aclweb.org/anthology/C14-1115

  • Torabi A, Pal CJ, Larochelle H, Courville AC (2015) Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070,

  • Unal ME, Citamak B, Yagcioglu S, Erdem A, Erdem E, Cinbis NI, Cakici R (2016) TasvirEt: A benchmark dataset for automatic Turkish description generation from images. In: Proc. 24th Signal Processing and Communication Application Conference (SIU), pp 1977–1980, https://doi.org/10.1109/SIU.2016.7496155

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp 5998–6008, http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

  • Vaswani A, Bengio S, Brevdo E, Chollet F, Gomez AN, Gouws S, Jones L, Kaiser L, Kalchbrenner N, Parmar N, Sepassi R, Shazeer N, Uszkoreit J (2018) Tensor2tensor for neural machine translation. arXiv:1803.07416

  • Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, pp 4566–4575

  • Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to Sequence-Video to Text. In: Proc. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile

  • Wang X, Chen W, Wu J, Wang Y, Wang WY (2018) Video captioning via hierarchical reinforcement learning. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, United States, arXiv:1711.11135

  • Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In: Proc. IEEE International Conference on Computer Vision (ICCV), Seoul, Korea

  • Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proc. 32nd International Conference on Machine Learning (ICML), JMLR Workshop and Conference Proceedings, Lille, France, pp 2048–2057, http://jmlr.org/proceedings/papers/v37/xuc15.pdf

  • Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: Proc. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA

  • Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proc. IEEE International Conference on Computer Vision (ICCV), Santiago, Chile

  • Yoshikawa Y, Shigeto Y, Takeuchi A (2017) STAIR captions: Constructing a large-scale japanese image caption dataset. In: Proc. 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, arXiv:1705.00823,

  • Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, arXiv:1510.07712,

  • Zeng K, Chen T, Niebles JC, Sun M (2016) Title generation for user generated videos. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, arXiv:1608.07068,

  • Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, California, USA, arXiv:1812.06587

  • Zhou L, Xu C, Corso JJ (2017) Procnets: Learning to segment procedures in untrimmed and unconstrained videos. arXiv:1703.09788

Download references

Acknowledgements

This work was supported in part by GEBIP 2018 Award of the Turkish Academy of Sciences to E. Erdem, BAGEP 2021 Award of the Science Academy to A. Erdem, and the MMVC project funded by TUBITAK and the British Council via the Newton Fund Institutional Links grant programme (Grant ID 219E054 and 352343575). Lucia Specia, Pranava Madhyastha and Ozan Caglayan also received support from MultiMT (H2020 ERC Starting Grant No. 678017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erkut Erdem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Citamak, B., Caglayan, O., Kuyu, M. et al. MSVD-Turkish: a comprehensive multimodal video dataset for integrated vision and language research in Turkish. Machine Translation 35, 265–288 (2021). https://doi.org/10.1007/s10590-021-09276-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-021-09276-y

Keywords

  • Video description dataset
  • Turkish
  • Video captioning
  • Video understanding
  • Neural machine translation
  • Multimodal machine translation