Understanding temporal structure for video captioning

  • Shagan SahEmail author
  • Thang Nguyen
  • Ray Ptucha
Theoretical advances


Recent research in convolutional and recurrent neural networks has fueled incredible advances in video understanding. We propose a video captioning framework that achieves the performance and quality necessary to be deployed in distributed surveillance systems. Our method combines an efficient hierarchical architecture with novel attention mechanisms at both the local and global levels. By shifting focus to different spatiotemporal locations, attention mechanisms correlate sequential outputs with activation maps, offering a clever way to adaptively combine multiple frames and locations of video. As soft attention mixing weights are solved via back-propagation, the number of weights or input frames needs to be known in advance. To remove this restriction, our video understanding framework combines continuous attention mechanisms over a family of Gaussian distributions. Our efficient multistream hierarchical model combines a recurrent architecture with a soft hierarchy layer using both equally spaced and dynamically localized boundary cuts. As opposed to costly volumetric attention approaches, we use video attributes to steer temporal attention. Our fully learnable end-to-end approach helps predict salient temporal regions of action/objects in the video. We demonstrate state-of-the-art captioning results on the popular MSVD, MSR-VTT and M-VAD video datasets and compare several variants of the algorithm suitable for real-time applications. By adjusting the frame rate, we show a single computer can generate effective video captions for 100 simultaneous cameras. We additionally perform studies to show how bit rate compression modifies captioning results.


Video captioning Deep learning Attention models Hierarchical neural networks 



  1. 1.
    Abadi M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems.
  2. 2.
    Anne Hendricks L, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: describing novel object categories without paired training data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–10Google Scholar
  3. 3.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473Google Scholar
  4. 4.
    Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization vol 29, pp 65–72Google Scholar
  5. 5.
    Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3185–3194Google Scholar
  6. 6.
    Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. In: Proceedings of the twenty-eighth conference on uncertainty in artificial intelligence, UAI’12. AUAI Press, Arlington, pp 102–112.
  7. 7.
    Bellard F, Niedermayer M et al (2012) Ffmpeg. http://ffmpeg.orgGoogle Scholar
  8. 8.
    Caba Heilbron F et al (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970Google Scholar
  9. 9.
    Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th association for computational linguistics: human language technologies vol 1, pp 190–200. ACLGoogle Scholar
  10. 10.
    Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 workshop on deep learning, December 2014Google Scholar
  11. 11.
    Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634Google Scholar
  12. 12.
    Dong J, Li X, Snoek C (2016) Word2visualvec: cross-media retrieval by visual feature prediction. CoRR abs/1604.06838Google Scholar
  13. 13.
    Dong J et al (2016) Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conferenceGoogle Scholar
  14. 14.
    Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2Google Scholar
  15. 15.
    Graves A (2013) Generating sequences with recurrent neural networks. CoRR abs/1308.0850Google Scholar
  16. 16.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  17. 17.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  18. 18.
    Johnson J et al (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPRGoogle Scholar
  19. 19.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp 3128–3137Google Scholar
  20. 20.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR, pp 1725–1732Google Scholar
  21. 21.
    Kingma DP, Ba JL (2014) Adam: a method for stochastic optimization. In: Proceedings of 3rd international conference learn. RepresentationsGoogle Scholar
  22. 22.
    Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2):171–184CrossRefzbMATHGoogle Scholar
  23. 23.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105Google Scholar
  24. 24.
    Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly localizing and describing events for dense video captioning. In: CVPRGoogle Scholar
  25. 25.
    Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop, vol 8Google Scholar
  26. 26.
    Lin G, Shen C, Van Den Hengel A, Reid I (2017) Exploring context with deep structured models for semantic segmentation. IEEE Trans Pattern Anal Mach IntellGoogle Scholar
  27. 27.
    Lin TY, Goyal P, Girshick R, He K, Dollár P (2018) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach IntellGoogle Scholar
  28. 28.
    Liu C, Mao J, Sha F, Yuille AL (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182Google Scholar
  29. 29.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440Google Scholar
  30. 30.
    Luc P, Neverova N, Couprie C, Verbeek J, LeCun Y (2017) Predicting deeper into the future of semantic segmentation. In: ICCV 2017-international conference on computer vision, p 10Google Scholar
  31. 31.
    Manning CD et al ( 2014) The stanford corenlp natural language processing toolkit. In: ACL, pp 55–60Google Scholar
  32. 32.
    Mettes P, Koelma DC, Snoek CG (2016) The imagenet shuffle: reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, ICMR ’16, pp 175–182. ACM, New York.
  33. 33.
    Microsoft COCO Caption Evaluation (2016) Accessed 3 Oct 2016
  34. 34.
    Pan P et al (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp 1029–1038Google Scholar
  35. 35.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318. ACLGoogle Scholar
  36. 36.
    Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: ICCV–international conference on computer vision, pp 1251–1259. IEEE, Venice.
  37. 37.
    Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832CrossRefGoogle Scholar
  38. 38.
    Pennington J et al (2014) Glove: global vectors for word representation. EMNLP 14:1532–43Google Scholar
  39. 39.
    Piergiovanni AJ, Fan C, Ryoo MS (2016) Temporal attention filters for human activity recognition in videos. CoRR abs/1605.08140Google Scholar
  40. 40.
    Pu Y, Min MR, Gan Z, Carin L (2018) Adaptive feature abstraction for translating video to text. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, New Orleans, Louisiana, USA, February 2–7, 2018Google Scholar
  41. 41.
    Ramanishka V et al (2016) Multimodal video description. In: Proceedings of the ACM on multimedia conference, pp 1092–1096Google Scholar
  42. 42.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPSGoogle Scholar
  43. 43.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  44. 44.
    Satat G, Tancik M, Gupta O, Heshmat B, Raskar R (2017) Object classification through scattering media with deep learning on time resolved measurement. Opt Exp 25(15):17466–17479CrossRefGoogle Scholar
  45. 45.
    Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM on multimedia conference, pp 1073–1076Google Scholar
  46. 46.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576Google Scholar
  47. 47.
    Srivastava N et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  48. 48.
    Sutskever et al (2014) Sequence to sequence learning with neural networks. In: NIPS, pp 3104–3112Google Scholar
  49. 49.
    Thomason J et al (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Coling, vol 2, p 9Google Scholar
  50. 50.
    Torabi A, Pal CJ, Larochelle H, Courville AC (2015) Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.01070Google Scholar
  51. 51.
    Tran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCVGoogle Scholar
  52. 52.
    Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: CVPR, pp 4566–4575Google Scholar
  53. 53.
    Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: NAACL HLTGoogle Scholar
  54. 54.
    Venugopalan S et al (2015) Sequence to sequence-video to text. In: ICCV, pp 4534–4542Google Scholar
  55. 55.
    Wang X, Chen W, Wu J, Wang YF, Wang WY (2018) Video captioning via hierarchical reinforcement learning. In: The IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  56. 56.
    Xu J et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: CVPRGoogle Scholar
  57. 57.
    Xu J, Song L, Xie R (2016) Shot boundary detection using convolutional neural networks. In: Visual communications and image processing (VCIP), IEEE, pp 1–4Google Scholar
  58. 58.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057Google Scholar
  59. 59.
    Xu R et al (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp 2346–2352Google Scholar
  60. 60.
    Yao L et al (2015) Describing videos by exploiting temporal structure. In: ICCV, pp 4507–4515Google Scholar
  61. 61.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659Google Scholar
  62. 62.
    Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593Google Scholar
  63. 63.
    Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3261–3269Google Scholar
  64. 64.
    Zhao B, Feng J, Wu X, Yan S (2017) A survey on deep learning-based fine-grained object classification and semantic segmentation. Int J Autom Comput 14(2):119–135CrossRefGoogle Scholar
  65. 65.
    Zilly JG, Srivastava RK, Koutník J, Schmidhuber J (2017) Recurrent highway networks. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 4189–4198. PMLR, International Convention Centre, Sydney, AustraliaGoogle Scholar
  66. 66.
    Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: ECCVGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Chester F. Carlson Center for Imaging ScienceRochester Institute of TechnologyRochesterUSA
  2. 2.Machine Intelligence Lab, Computer Engineering DepartmentRochester Institute of TechnologyRochesterUSA

Personalised recommendations