Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)


Automatically generating sentences to describe events and temporally localizing sentences in a video are two important tasks that bridge language and videos. Recent techniques leverage the multimodal nature of videos by using off-the-shelf features to represent videos, but interactions between modalities are rarely explored. Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks. We model modality interaction in both the sequence and channel levels in a pairwise fashion, and the pairwise interaction also provides some explainability for the predictions of target tasks. We demonstrate the effectiveness of our method and validate specific design choices through extensive ablation studies. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets: MSVD and MSR-VTT (event captioning task), and Charades-STA and ActivityNet Captions (temporal sentence localization task).


Temporal sentence localization Event captioning in videos Modality interaction 



Shaoxiang Chen is partially supported by the Tencent Elite Internship program.

Supplementary material

504439_1_En_20_MOESM1_ESM.pdf (218 kb)
Supplementary material 1 (pdf 217 KB)


  1. 1.
    Aafaq, N., Akhtar, N., Liu, W., Gilani, S.Z., Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: CVPR (2019)Google Scholar
  2. 2.
    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)Google Scholar
  3. 3.
    Baier, B., Kleinschmidt, A., Müller, N.G.: Cross-modal processing in early visual and auditory cortices depends on expected statistical relationship of multisensory information. J. Neurosci. 26(47), 12260–12265 (2006)CrossRefGoogle Scholar
  4. 4.
    Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR (2017)Google Scholar
  5. 5.
    Calvert, G.A.: Crossmodal processing in the human brain: insights from functional neuroimaging studies. Cereb. Cortex 11(12), 1110–1123 (2001)CrossRefGoogle Scholar
  6. 6.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)Google Scholar
  7. 7.
    Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., Mei, T.: Temporal deformable convolutional encoder-decoder networks for video captioning. In: AAAI (2019)Google Scholar
  8. 8.
    Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.: Temporally grounding natural sentence in video. In: EMNLP (2018)Google Scholar
  9. 9.
    Chen, S., Jiang, Y.: Motion guided spatial attention for video captioning. In: AAAI (2019)Google Scholar
  10. 10.
    Chen, S., Jiang, Y.: Semantic proposal for activity localization in videos via sentence query. In: AAAI (2019)Google Scholar
  11. 11.
    Chen, S., Chen, J., Jin, Q., Hauptmann, A.G.: Video captioning with guidance of multimodal latent topics. In: ACM MM (2017)Google Scholar
  12. 12.
    Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 367–384. Springer, Cham (2018). Scholar
  13. 13.
    Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
  14. 14.
    Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: WMT@ACL (2014)Google Scholar
  15. 15.
    Eckert, M.A., Kamdar, N.V., Chang, C.E., Beckmann, C.F., Greicius, M.D., Menon, V.: A cross-modal system linking primary auditory and visual cortices: evidence from intrinsic fMRI connectivity analysis. Hum. Brain Mapp. 29(7), 848–857 (2008)CrossRefGoogle Scholar
  16. 16.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016)Google Scholar
  17. 17.
    Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV (2017)Google Scholar
  18. 18.
    Gao, P., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR (2019)Google Scholar
  19. 19.
    Ge, R., Gao, J., Chen, K., Nevatia, R.: MAC: mining activity concepts for language-based temporal localization. In: WACV (2019)Google Scholar
  20. 20.
    Hahn, M., Kadav, A., Rehg, J.M., Graf, H.P.: Tripping through time: efficient localization of activities in videos. arXiv preprint arXiv:1904.09936 (2019)
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  22. 22.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  23. 23.
    Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: ICCV (2017)Google Scholar
  24. 24.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  25. 25.
    Hori, C., et al.: Attention-based multimodal fusion for video description. In: ICCV (2017)Google Scholar
  26. 26.
    Hu, Y., Chen, Z., Zha, Z., Wu, F.: Hierarchical global-local temporal modeling for video captioning. In: ACM MM (2019)Google Scholar
  27. 27.
    Jiang, B., Huang, X., Yang, C., Yuan, J.: Cross-modal video moment retrieval with spatial and language-temporal attention. In: ICMR (2019)Google Scholar
  28. 28.
    Jin, T., Huang, S., Li, Y., Zhang, Z.: Low-rank HOCA: efficient high-order cross-modal attention for video captioning. In: EMNLP-IJCNLP (2019)Google Scholar
  29. 29.
    Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  30. 30.
    Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: NeurIPS (2018)Google Scholar
  31. 31.
    Kim, J., On, K.W., Lim, W., Kim, J., Ha, J., Zhang, B.: Hadamard product for low-rank bilinear pooling. In: ICLR (2017)Google Scholar
  32. 32.
    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: ICCV (2017)Google Scholar
  33. 33.
    Li, X., Zhao, B., Lu, X.: MAM-RNN: multi-level attention model based RNN for video captioning. In: IJCAI (2017)Google Scholar
  34. 34.
    Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.: Attentive moment retrieval in videos. In: ACM SIGIR (2018)Google Scholar
  35. 35.
    Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL (2018)Google Scholar
  36. 36.
    Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. TACL 6, 173–184 (2018)CrossRefGoogle Scholar
  37. 37.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR (2016)Google Scholar
  38. 38.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)Google Scholar
  39. 39.
    Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y.: Memory-attended recurrent network for video captioning. In: CVPR (2019)Google Scholar
  40. 40.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)Google Scholar
  41. 41.
    Rahman, T., Xu, B., Sigal, L.: Watch, listen and tell: multi-modal weakly supervised dense event captioning. In: ICCV (2019)Google Scholar
  42. 42.
    Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL-HLT (2018)Google Scholar
  43. 43.
    Shen, Z., et al.: Weakly supervised dense video captioning. In: CVPR (2017)Google Scholar
  44. 44.
    Shi, X., Cai, J., Joty, S.R., Gu, J.: Watch it twice: video captioning with a refocused video encoder. In: ACM MM (2019)Google Scholar
  45. 45.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  46. 46.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  47. 47.
    Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI (2017)Google Scholar
  48. 48.
    Song, W., et al.: AutoInt: automatic feature interaction learning via self-attentive neural networks. In: CIKM (2019)Google Scholar
  49. 49.
    Song, X., Han, Y.: VAL: visual-attention action localizer. In: Hong, R., Cheng, W.-H., Yamasaki, T., Wang, M., Ngo, C.-W. (eds.) PCM 2018. LNCS, vol. 11165, pp. 340–350. Springer, Cham (2018). Scholar
  50. 50.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI (2017)Google Scholar
  51. 51.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  52. 52.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  53. 53.
    Tu, Y., Zhang, X., Liu, B., Yan, C.: Video description with spatial-temporal attention. In: ACM MM (2017)Google Scholar
  54. 54.
    Ulyanov, D., Vedaldi, A., Lempitsky, V.S.: Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
  55. 55.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  56. 56.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)Google Scholar
  57. 57.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV (2015)Google Scholar
  58. 58.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL-HLT (2015)Google Scholar
  59. 59.
    Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: ICCV (2019)Google Scholar
  60. 60.
    Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR (2018)Google Scholar
  61. 61.
    Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion with context gating for dense video captioning. In: CVPR (2018)Google Scholar
  62. 62.
    Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: AAAI (2020)Google Scholar
  63. 63.
    Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: M3: multimodal memory modelling for video captioning. In: CVPR (2018)Google Scholar
  64. 64.
    Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: CVPR (2019)Google Scholar
  65. 65.
    Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  66. 66.
    Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015)
  67. 67.
    Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI (2019)Google Scholar
  68. 68.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR (2016)Google Scholar
  69. 69.
    Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: ACM MM (2017)Google Scholar
  70. 70.
    Yang, Z., Han, Y., Wang, Z.: Catching the temporal regions-of-interest for video captioning. In: ACM MM (2017)Google Scholar
  71. 71.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV (2015)Google Scholar
  72. 72.
    Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)Google Scholar
  73. 73.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR (2016)Google Scholar
  74. 74.
    Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI (2019)Google Scholar
  75. 75.
    Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: CVPR (2019)Google Scholar
  76. 76.
    Zhang, X., Gao, K., Zhang, Y., Zhang, D., Li, J., Tian, Q.: Task-driven dynamic fusion: reducing ambiguity in video description. In: CVPR (2017)Google Scholar
  77. 77.
    Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: CVPR (2018)Google Scholar
  78. 78.
    Zhu, L., Xu, Z., Yang, Y.: Bidirectional multirate reconstruction for temporal modeling in videos. In: CVPR (2017)Google Scholar
  79. 79.
    Zhu, Y., Jiang, S.: Attention-based densely connected LSTM for video captioning. In: ACM MM (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Shanghai Key Lab of Intelligent Information Processing, School of Computer ScienceFudan UniversityShanghaiChina
  2. 2.Tencent AI LabBellevueUSA

Personalised recommendations