Abstract
To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.
Similar content being viewed by others
References
Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A, Garcia-Rodriguez J (2016) A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process Lett 43(2):363–387
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1. Association for Computational Linguistics, pp 190–200
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation, vol 6
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Fernando B, Gould S (2016) Learning end-to-end video classification with rank-pooling. In: Proceedings of the 33rd international conference on machine learning, vol 48. JMLR: W&CP, New York
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Hong C, Chen X, Wang X, Tang C (2016) Hypergraph regularized autoencoder for image-based 3d human pose recovery. Signal Process 124:132–140
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp 1817–1824
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2015) Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision, Springer, Berlin, pp 581–595
Rekabdar B, Nicolescu M, Nicolescu M, Saffar MT, Kelley R (2016) A scale and translation invariant approach for early classification of spatio-temporal patterns using spiking neural networks. Neural Process Lett 43(2):327–343
Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd international conference on machine learning, vol 37. JMLR: W&CP, Lille
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Team TTD, Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, et al. (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) C3d: generic features for video analysis. CoRR, abs/14120767 2:7
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–881
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2015) Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738
Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern. doi:10.1109/TCYB.2016.2591583
Zeiler MD (2012) Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guo, D., Li, W. & Fang, X. Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism. Neural Process Lett 46, 313–328 (2017). https://doi.org/10.1007/s11063-017-9591-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-017-9591-9