Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

Guo, Dashan; Li, Wei; Fang, Xiangzhong

doi:10.1007/s11063-017-9591-9

Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

Published: 25 January 2017

Volume 46, pages 313–328, (2017)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

633 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Article 14 June 2021

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

References

Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A, Garcia-Rodriguez J (2016) A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process Lett 43(2):363–387
Article Google Scholar
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1. Association for Computational Linguistics, pp 190–200
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation, vol 6
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Fernando B, Gould S (2016) Learning end-to-end video classification with rank-pooling. In: Proceedings of the 33rd international conference on machine learning, vol 48. JMLR: W&CP, New York
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Article MathSciNet Google Scholar
Hong C, Chen X, Wang X, Tang C (2016) Hypergraph regularized autoencoder for image-based 3d human pose recovery. Signal Process 124:132–140
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp 1817–1824
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2015) Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision, Springer, Berlin, pp 581–595
Rekabdar B, Nicolescu M, Nicolescu M, Saffar MT, Kelley R (2016) A scale and translation invariant approach for early classification of spatio-temporal patterns using spiking neural networks. Neural Process Lett 43(2):327–343
Article Google Scholar
Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd international conference on machine learning, vol 37. JMLR: W&CP, Lille
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Team TTD, Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, et al. (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) C3d: generic features for video analysis. CoRR, abs/14120767 2:7
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–881
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2015) Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738
Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern. doi:10.1109/TCYB.2016.2591583
Zeiler MD (2012) Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701

Download references

Author information

Authors and Affiliations

The Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, 200240, China
Dashan Guo, Wei Li & Xiangzhong Fang

Authors

Dashan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiangzhong Fang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dashan Guo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, D., Li, W. & Fang, X. Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism. Neural Process Lett 46, 313–328 (2017). https://doi.org/10.1007/s11063-017-9591-9

Download citation

Published: 25 January 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s11063-017-9591-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Video summarization using deep learning techniques: a detailed analysis and investigation

TextConvoNet: a convolutional neural network based architecture for text classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Video summarization using deep learning techniques: a detailed analysis and investigation

TextConvoNet: a convolutional neural network based architecture for text classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation