Multi-guiding long short-term memory for video captioning

  • Ning Xu
  • An-An Liu
  • Weizhi Nie
  • Yuting Su
Special Issue Paper


Recently, research interests have been paid for using recurrent neural network (RNN) as the decoder in video captioning task. However, the generated sentence seems to “lose track” of the video content due to the fixed language rule. Though existing methods try to “guide” the decoder and keep it “on track”, they mainly rely on a single-modal feature that does not fit the multi-modal (visual and semantic) and the complementary (local and global) nature of the video captioning task. To this end, we propose the multi-guiding long short-term memory (mg-LSTM), an extension of LSTM network for video captioning. We add global information (i.e., detected attributes) and local information (i.e., appearance features) extracted from the video as extra input to each cell of LSTM, with the aim of collaboratively guiding the model towards solutions that are more tightly coupled to the video content. In particular, the appearance and attribute features are first used to produce local and global guiders, respectively. We propose a novel cell-wise ensemble, where the weight matrix of each cell of LSTM is extended to be a set of attribute-dependent and attention-dependent weight matrices, by which the guiders induce each cell optimization over time. Extensive experiments on three benchmark datasets (i.e., MSVD, MSR-VTT, and MPII-MD) show that our method can achieve competitive results against the state of the art. Additional ablation studies are conducted on variants of the proposed mg-LSTM.



This work was supported in part by the National Natural Science Foundation of China (61772359, 61472275, 61502337).


  1. 1.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)Google Scholar
  2. 2.
    Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 1657–1666 (2017)Google Scholar
  3. 3.
    Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM TIST 2(3), 27:1–27:27 (2011)Google Scholar
  4. 4.
    Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)Google Scholar
  5. 5.
    Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: WMT@ACL, pp. 376–380 (2014)Google Scholar
  6. 6.
    Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACMMM, pp. 1082–1086 (2016)Google Scholar
  7. 7.
    Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: CVPR, pp. 1141–1150 (2017)Google Scholar
  8. 8.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRefGoogle Scholar
  9. 9.
    Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R.J., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp. 2712–2719 (2013)Google Scholar
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  11. 11.
    Hori, C., Hori, T., Lee, T.Y., Zhang, Z., Harsham, B., Hershey, J.R., Marks, T.K., Sumi, K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4193–4202 (2017)Google Scholar
  12. 12.
    Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV, pp. 2407–2415 (2015)Google Scholar
  13. 13.
    Jin, Q., Chen, J., Chen, S., Xiong, Y., Hauptmann, A.G.: Describing videos using multi-modal fusion. In: ACMM MM, pp. 1087–1091 (2016)Google Scholar
  14. 14.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)Google Scholar
  15. 15.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL Workshop, pp. 74–81 (2004)Google Scholar
  16. 16.
    Liu, Z., Cheng, L., Liu, A., Zhang, L., He, X., Zimmermann, R.: Multiview and Multimodal Pervasive Indoor Localization. In: ACMMM, pp. 109-117 (2017)Google Scholar
  17. 17.
    Liu, A., Xu, N., Wong, Y., Li, J., Su, Y., Kankanhalli, M.S.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. CVIU 163, 113–125 (2017)Google Scholar
  18. 18.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)Google Scholar
  19. 19.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)Google Scholar
  20. 20.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, pp. 984–992 (2017)Google Scholar
  21. 21.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)Google Scholar
  22. 22.
    Ramanishka, V., Das, A., Park, D.H., Venugopalan, S., Hendricks, L.A., Rohrbach, M., Saenko, K.: Multimodal video description. In: ACM MM, pp. 1092–1096 (2016)Google Scholar
  23. 23.
    Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: GCPR, pp. 209–221 (2015)Google Scholar
  24. 24.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR, pp. 3202–3212 (2015)Google Scholar
  25. 25.
    Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y., Xue, X.: Weakly supervised dense video captioning. In: CVPR, pp. 1916–1924 (2017)Google Scholar
  26. 26.
    Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM MM, pp. 1073–1076 (2016)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  28. 28.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  29. 29.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  30. 30.
    Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016)
  31. 31.
    Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.J.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: COLING, pp. 1218–1227 (2014)Google Scholar
  32. 32.
    Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)Google Scholar
  33. 33.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015)Google Scholar
  34. 34.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: HLT-NAACL, pp. 1494–1504 (2015)Google Scholar
  35. 35.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR, pp. 3156–3164 (2015)Google Scholar
  36. 36.
    Wu, Q., Shen, C., Liu, L., Dick, A.R., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212 (2016)Google Scholar
  37. 37.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)Google Scholar
  38. 38.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  39. 39.
    Xu, N., Liu, A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Video Techn. (2018). CrossRefGoogle Scholar
  40. 40.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C.J., Larochelle, H., Courville, A.C.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)Google Scholar
  41. 41.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)Google Scholar
  42. 42.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV, pp. 818–833 (2014)Google Scholar
  43. 43.
    Zhang, X., Gao, K., Zhang, Y., Zhang, D., Tian, Q.: Task-driven dynamic fusion: Reducing ambiguity in video description. In: CVPR (2017)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Electrical and Information EngineeringTianjin UniversityTianjinChina

Personalised recommendations