Advertisement

Refocused Attention: Long Short-Term Rewards Guided Video Captioning

  • Jiarong Dong
  • Ke GaoEmail author
  • Xiaokai Chen
  • Juan Cao
Article
  • 41 Downloads

Abstract

The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called ‘Attention Defocus’ problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features.

Keywords

Video captioning Hierarchical attention Reinforcement learning Reward 

Notes

Acknowledgements

This work was supported by the National Key Research and Development Program (2017YFC0 820601), Beijing Science and Technology Project (Z171100000117010), National Nature Science Foundation of China (61571424, U1703261), Beijing Municipal Natural Science Foundation Cooperation Beijing Education Committee: No. KZ 201810005002.

References

  1. 1.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  2. 2.
    Chen S, Chen J, Jin Q, Hauptmann A (2017) Video captioning with guidance of multimodal latent topics. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 1838–1846Google Scholar
  3. 3.
    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 507–4515Google Scholar
  4. 4.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998
  5. 5.
    Dong J, Li X, Lan W, Huo Y, Snoek CG (2016) Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1082–1086Google Scholar
  6. 6.
    Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Lawrence Zitnick C (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482Google Scholar
  7. 7.
    Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multi-modal fusion. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1087–1091Google Scholar
  8. 8.
    Zhang H, Niu Y, Chang S-F (2018) Grounding referring expressions in images by variational context. In: The IEEE conference on computer vision and pattern recognitionGoogle Scholar
  9. 9.
    Zhang H, Kyaw Z, Yu J, Chang S-F (2017) PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In: ICCVGoogle Scholar
  10. 10.
    Zhang H, Kyaw Z, Chang S-F, Chua T-S (2017) Visual translation embedding network for visual relation detection. In: CVPRGoogle Scholar
  11. 11.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318Google Scholar
  12. 12.
    Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380Google Scholar
  13. 13.
    Lin CY (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches outGoogle Scholar
  14. 14.
    Vedantam R, Lawrence Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575Google Scholar
  15. 15.
    Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, vol 1, no 2, p 3Google Scholar
  16. 16.
    Song J, Guo Z, Gao L, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231
  17. 17.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057Google Scholar
  18. 18.
    Chen J, Zhang H, He X, Nie L, Liu W, Chua TS (2017) Attentive collaborative filtering: multimedia recommendation with item- and component-level attention. In: International ACM SIGIR conference on research and development in information retrieval. ACMGoogle Scholar
  19. 19.
    He X, He Z, Song J, Liu Z, Jiang YG, Chua TS (2018) NAIS: neural attentive item similarity model for recommendation. IEEE Trans Knowl Data Eng 12:2354–2366CrossRefGoogle Scholar
  20. 20.
    Li X, Zhao B, Lu X (2017) MAM-RNN: multi-level attention model based RNN for video captioning. In: Proceedings of the twenty-sixth international joint conference on artificial intelligenceGoogle Scholar
  21. 21.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  22. 22.
    Vendrov I, Kiros R, Fidler S, Urtasun R (2015) Order-embeddings of images and language. arXiv preprint arXiv:1511.06361
  23. 23.
    Nie W, Cheng H, Su Y (2017) Modeling temporal information of mitotic for mitotic event detection. IEEE Trans Big Data 3:458–469CrossRefGoogle Scholar
  24. 24.
    Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542Google Scholar
  25. 25.
    Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2712–2719Google Scholar
  26. 26.
    Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296Google Scholar
  27. 27.
    Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325
  28. 28.
    Kingma DP, Ba J (2014) ADAM: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
  29. 29.
    Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038Google Scholar
  30. 30.
    Tu Y, Zhang X, Liu B, Yan C (2017) Video description with spatial-temporal attention. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 1014–1022Google Scholar
  31. 31.
    Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: CVPR, vol 2, p 3Google Scholar
  32. 32.
    Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
  33. 33.
    Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1073–1076Google Scholar
  34. 34.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
  35. 35.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  36. 36.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  37. 37.
    Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI, vol 4, p 12Google Scholar
  38. 38.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497Google Scholar
  39. 39.
    Davis SB, Mermelstein P (1990) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Waibel A, Lee K (eds) Readings in speech recognition. Morgan Kaufmann Publishers Inc., San Francisco, pp 65–74CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations