Advertisement

World Wide Web

, Volume 22, Issue 2, pp 621–636 | Cite as

Residual attention-based LSTM for video captioning

  • Xiangpeng Li
  • Zhilong Zhou
  • Lijiang Chen
  • Lianli GaoEmail author
Article
Part of the following topical collections:
  1. Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications

Abstract

Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number of LSTM layers increasing, accuracy gets saturated and then degrades rapidly like standard deep convolutional networks such as VGG. In this paper, we propose a novel attention-based framework, namely Residual Attention-based LSTM (Res-ATT), which not only takes advantage of existing attention mechanism but also considers the importance of sentence internal information which usually gets lost in the transmission process. Our key novelty is that we show how to integrate residual mapping into a hierarchical LSTM network to solve the degradation problem. More specifically, our novel hierarchical architecture builds on two LSTMs layers and residual mapping is introduced to avoid the loss of previous generated words information (i.e., both content information and relationship information). Experimental results on the mainstream datasets: MSVD and MSR-VTT, which shows that our framework outperforms the state-of-the-art approaches. Furthermore, our automatically generated sentences can provide more detailed information to precisely describe a video.

Keywords

LSTM Attention mechanism Residual thought Video captioning 

Notes

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. ZYGX2016J085), the National Natural Science Foundation of China (Grant No. 61772116, No. 61502080, No. 61632007) and the 111 Project (Grant No. B17008).

References

  1. 1.
    Banerjee, S., Lavie, A.: Meteor: an Automatic Metric for Mt Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/Or Summarization, Vol. 29, pp. 65–72 (2005)Google Scholar
  2. 2.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From Captions to Visual Concepts and Back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)Google Scholar
  3. 3.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRefGoogle Scholar
  4. 4.
    Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)Google Scholar
  5. 5.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Kafle, K., Kanan, C.: Answer-Type Prediction for Visual Question Answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976–4984 (2016)Google Scholar
  8. 8.
    Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50(2), 171–184 (2002)CrossRefzbMATHGoogle Scholar
  9. 9.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  10. 10.
    Li, X., Gao, L., Xu, X., Shao, J., Shen, F., Song, J.: Kernel based latent semantic sparse hashing for large-scale retrieval from heterogeneous data sources. Neurocomputing 253, 89–96 (2017)CrossRefGoogle Scholar
  11. 11.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632 (2014)
  12. 12.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038 (2016)Google Scholar
  13. 13.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2016)Google Scholar
  14. 14.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. arXiv:1611.07675 (2016)
  15. 15.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40Th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  16. 16.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433–440 (2013)Google Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  18. 18.
    Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: IJCAI, pp. 2737–2743 (2017)Google Scholar
  19. 19.
    Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018)CrossRefGoogle Scholar
  20. 20.
    Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Process. 25(11), 4999–5011 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. arXiv:1707.02112 (2017)
  22. 22.
    Song, J., Shen, H.T., Wang, J., Huang, Z., Sebe, N., Wang, J.: A distance-computation-free search scheme for binary code databases. IEEE Trans. Multimed. 18(3), 484–495 (2016)CrossRefGoogle Scholar
  23. 23.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  24. 24.
    Teney, D., Anderson, P., He, X., Hengel, A.V.D.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv:1708.02711 (2017)
  25. 25.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  26. 26.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  27. 27.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to Sequence-Video to Text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  28. 28.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729 (2014)
  29. 29.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and Tell: a Neural Image Caption Generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  30. 30.
    Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)Google Scholar
  31. 31.
    Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.: Image captioning with an intermediate attributes layer. arXiv:1506.01144 (2015)
  32. 32.
    Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-Vtt: a Large Video Description Dataset for Bridging Video and Language. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  33. 33.
    Xu, R., Xiong, C., Chen, W., Corso, J.J.: Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In: AAAI, vol. 5, p. 6 (2015)Google Scholar
  34. 34.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing Videos by Exploiting Temporal Structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar
  35. 35.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2016)Google Scholar
  36. 36.
    Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv:1212.5701 (2012)
  37. 37.
    Zhang, H., Wang, M., Hong, R., Chua, T.S.: Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 781–790. ACM (2016)Google Scholar
  38. 38.
    Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143–152 (2013)Google Scholar
  39. 39.
    Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450–461 (2016)CrossRefGoogle Scholar
  40. 40.
    Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737–3750 (2014)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina
  2. 2.Beijing Afanti Technology Co, LTD.BeijingChina

Personalised recommendations