Skip to main content
Log in

Understanding temporal structure for video captioning

  • Theoretical advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abadi M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. http://tensorflow.org/

  2. Anne Hendricks L, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: describing novel object categories without paired training data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–10

  3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473

  4. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization vol 29, pp 65–72

  5. Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3185–3194

  6. Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence, UAI’12. AUAI Press, Arlington, pp 102–112. http://dl.acm.org/citation.cfm?id=3020652.3020667

  7. Bellard F, Niedermayer M et al (2012) Ffmpeg. http://ffmpeg.org

  8. Caba Heilbron F et al (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970

  9. Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th association for computational linguistics: human language technologies vol 1, pp 190–200. ACL

  10. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning, December 2014

  11. Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634

  12. Dong J, Li X, Snoek C (2016) Word2visualvec: cross-media retrieval by visual feature prediction. CoRR abs/1604.06838

  13. Dong J et al (2016) Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conference

  14. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2

  15. Graves A (2013) Generating sequences with recurrent neural networks. CoRR abs/1308.0850

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  18. Johnson J et al (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR

  19. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp 3128–3137

  20. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR, pp 1725–1732

  21. Kingma DP, Ba JL (2014) Adam: a method for stochastic optimization. In: Proceedings of 3rd international conference learn. Representations

  22. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2):171–184

    Article  Google Scholar 

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

  24. Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. In: CVPR

  25. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop, vol 8

  26. Lin G, Shen C, Van Den Hengel A, Reid I (2017) Exploring context with deep structured models for semantic segmentation. IEEE Trans Pattern Anal Mach Intell

  27. Lin TY, Goyal P, Girshick R, He K, Dollár P (2018) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell

  28. Liu C, Mao J, Sha F, Yuille AL (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182

  29. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  30. Luc P, Neverova N, Couprie C, Verbeek J, LeCun Y (2017) Predicting deeper into the future of semantic segmentation. In: ICCV 2017-international conference on computer vision, p 10

  31. Manning CD et al ( 2014) The stanford corenlp natural language processing toolkit. In: ACL, pp 55–60

  32. Mettes P, Koelma DC, Snoek CG (2016) The imagenet shuffle: reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, ICMR ’16, pp 175–182. ACM, New York. https://doi.org/10.1145/2911996.2912036

  33. Microsoft COCO Caption Evaluation (2016) https://github.com/tylin/coco-caption. Accessed 3 Oct 2016

  34. Pan P et al (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp 1029–1038

  35. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318. ACL

  36. Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: ICCV–international conference on computer vision, pp 1251–1259. IEEE, Venice. https://doi.org/10.1109/ICCV.2017.140. https://hal.inria.fr/hal-01428963

  37. Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832

    Article  Google Scholar 

  38. Pennington J et al (2014) Glove: global vectors for word representation. EMNLP 14:1532–43

    Google Scholar 

  39. Piergiovanni AJ, Fan C, Ryoo MS (2016) Temporal attention filters for human activity recognition in videos. CoRR abs/1605.08140

  40. Pu Y, Min MR, Gan Z, Carin L (2018) Adaptive feature abstraction for translating video to text. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, New Orleans, Louisiana, USA, February 2–7, 2018

  41. Ramanishka V et al (2016) Multimodal video description. In: Proceedings of the ACM on multimedia conference, pp 1092–1096

  42. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS

  43. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  44. Satat G, Tancik M, Gupta O, Heshmat B, Raskar R (2017) Object classification through scattering media with deep learning on time resolved measurement. Opt Exp 25(15):17466–17479

    Article  Google Scholar 

  45. Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM on multimedia conference, pp 1073–1076

  46. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  47. Srivastava N et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  48. Sutskever et al (2014) Sequence to sequence learning with neural networks. In: NIPS, pp 3104–3112

  49. Thomason J et al (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Coling, vol 2, p 9

  50. Torabi A, Pal CJ, Larochelle H, Courville AC (2015) Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.01070

  51. Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV

  52. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575

  53. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: NAACL HLT

  54. Venugopalan S et al (2015) Sequence to sequence-video to text. In: ICCV, pp 4534–4542

  55. Wang X, Chen W, Wu J, Wang YF, Wang WY (2018) Video captioning via hierarchical reinforcement learning. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  56. Xu J et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: CVPR

  57. Xu J, Song L, Xie R (2016) Shot boundary detection using convolutional neural networks. In: Visual communications and image processing (VCIP), IEEE, pp 1–4

  58. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  59. Xu R et al (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp 2346–2352

  60. Yao L et al (2015) Describing videos by exploiting temporal structure. In: ICCV, pp 4507–4515

  61. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  62. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593

  63. Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3261–3269

  64. Zhao B, Feng J, Wu X, Yan S (2017) A survey on deep learning-based fine-grained object classification and semantic segmentation. Int J Autom Comput 14(2):119–135

    Article  Google Scholar 

  65. Zilly JG, Srivastava RK, Koutník J, Schmidhuber J (2017) Recurrent highway networks. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 4189–4198. PMLR, International Convention Centre, Sydney, Australia

  66. Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: ECCV

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shagan Sah.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sah, S., Nguyen, T. & Ptucha, R. Understanding temporal structure for video captioning. Pattern Anal Applic 23, 147–159 (2020). https://doi.org/10.1007/s10044-018-00770-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-018-00770-3

Keywords

Navigation