World Wide Web

, Volume 22, Issue 2, pp 735–749 | Cite as

Exploiting long-term temporal dynamics for video captioning

  • Yuyu Guo
  • Jingqiu Zhang
  • Lianli GaoEmail author
Part of the following topical collections:
  1. Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications


Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.


RNNs Video captioning Long-term temporal dynamics 



This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. ZYGX2016J085), the National Natural Science Foundation of China (Grant No. 61772116, No. 61502080, No. 61632007) and the 111 Project (Grant No. B17008).


  1. 1.
    Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Warde-farley, D., Bengio, Y.: Theano: new features and speed improvements. CoRR arXiv:1211.5590 (2012)
  2. 2.
    Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  3. 3.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: ECCV, pp. 25–36 (2004)Google Scholar
  4. 4.
    Chen, D., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: ACL HLT, pp. 190–200 (2011)Google Scholar
  5. 5.
    Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555 (2014)
  6. 6.
    Denkowski, M.J., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
  7. 7.
    Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)CrossRefGoogle Scholar
  8. 8.
    Elman, J.L.: Finding structure in time. Cognit. Sci. 14(2), 179–211 (1990)CrossRefGoogle Scholar
  9. 9.
    Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.A.: Every Picture Tells a Story: Generating Sentences from Images. In: ECCV, pp. 15–29 (2010)Google Scholar
  10. 10.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 1933–1941 (2016)Google Scholar
  11. 11.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017). CrossRefGoogle Scholar
  12. 12.
    Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Shen, H.T.: Optimal Graph Learning with Partial Tags and Multiple Features for Image and Video Annotation. In: CVPR, pp. 4371–4379 (2015)Google Scholar
  13. 13.
    Gao, L., Song, J., Nie, F., Zou, F., Sebe, N., Shen, H.T.: Graph-Without-Cut: an Ideal Graph Learning for Image Segmentation. In: AAAI, pp. 1188–1194 (2016)Google Scholar
  14. 14.
    Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-Based LSTM with Semantic Consistency for Videos Captioning. In: ACM MM, pp. 357–361 (2016)Google Scholar
  15. 15.
    Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: ECCV, pp. 372–380 (2012)Google Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  17. 17.
    Jordan, M.I.: Serial Order: A Parallel, Distributed Processing Approach. In: Advances in Connectionist Theory: Speech. Erlbaum (1989)Google Scholar
  18. 18.
    Khan, M.U.G., Zhang, L., Gotoh, Y.: Human Focused Video Description. In: ICCV, pp. 1480–1487 (2011)Google Scholar
  19. 19.
    Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171–184 (2002)Google Scholar
  20. 20.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  21. 21.
    Lee, M.W., Hakeem, A., Haering, N., Zhu, S.: SAVE: A Framework for Semantic Annotation of Visual Events. In: CVPR, pp. 1–8 (2008)Google Scholar
  22. 22.
    Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. CoRR arXiv:1612.00234 (2016)
  23. 23.
    Ma, C., Chen, M., Kira, Z., Alregib, G.: TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. CoRR arXiv:1703.10667 (2017)
  24. 24.
    Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent Neural Network Based Language Model. In: INTERSPEECH, pp. 1045–1048 (2010)Google Scholar
  25. 25.
    Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent Models of Visual Attention. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 2204–2212 (2014)Google Scholar
  26. 26.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPR, pp. 1029–1038 (2016)Google Scholar
  27. 27.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, pp. 4594–4602 (2016)Google Scholar
  28. 28.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video Captioning with Transferred Semantic Attributes. In: CVPR (2017)Google Scholar
  29. 29.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: ACL, pp. 311–318 (2002)Google Scholar
  30. 30.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS (2015)Google Scholar
  31. 31.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: ICCV, pp. 433–440 (2013)Google Scholar
  32. 32.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Scherer, D., Müller, A.C., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Artificial Neural Networks - ICANN 2010 - 20Th International Conference, Thessaloniki, Greece, September 15–18, 2010, Proceedings, Part III, pp. 92–101 (2010)Google Scholar
  34. 34.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  35. 35.
    Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 568–576 (2014)Google Scholar
  36. 36.
    Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR (2014)Google Scholar
  37. 37.
    Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19–25, 2017, pp. 2737–2743 (2017)Google Scholar
  38. 38.
    Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018)Google Scholar
  39. 39.
    Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Processing 25(11), 4999–5011 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Song, J., Gao, L., Puscas, M.M., Nie, F., Shen, F., Sebe, N.: Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, the Netherlands, October 15–19, 2016, pp. 831–840 (2016)Google Scholar
  41. 41.
    Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. CoRR arXiv:1707.02112 (2017)
  42. 42.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  43. 43.
    Thonnat, M., Rota, N.: Image understanding for visual surveillance applications (2000)Google Scholar
  44. 44.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. ICCVGoogle Scholar
  45. 45.
    Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: CVPR, pp. 4566–4575 (2015)Google Scholar
  46. 46.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R. J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534–4542 (2015)Google Scholar
  47. 47.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In: NAACL HLT, pp. 1494–1504 (2015)Google Scholar
  48. 48.
    Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Multimodal memory modelling for video captioning. CoRR arXiv:1611.05592 (2016)
  49. 49.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: Computer Vision - ECCV 2016 - 14Th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part VIII, pp. 20–36 (2016)Google Scholar
  50. 50.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, pp. 5288–5296 (2016)Google Scholar
  51. 51.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C. J., Larochelle, H., Courville, A. C.: Describing Videos by Exploiting Temporal Structure. In: ICCV, pp. 4507–4515 (2015)Google Scholar
  52. 52.
    Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. CoRR arXiv:1611.01646 (2016)
  53. 53.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, pp. 4584–4593 (2016)Google Scholar
  54. 54.
    Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In: CVPR (2017)Google Scholar
  55. 55.
    Zeiler, M.D.: ADADELTA: An adaptive learning rate method. CoRR arXiv:1212.5701 (2012)
  56. 56.
    Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143–152 (2013)Google Scholar
  57. 57.
    Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450–461 (2016)CrossRefGoogle Scholar
  58. 58.
    Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737–3750 (2014)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Future Media Center and School of Computer Science and EngineeringThe University of Electronic Science and Technology of ChinaChengduChina

Personalised recommendations