Cross-Modal and Hierarchical Modeling of Video and Text

  • Bowen Zhang
  • Hexiang Hu
  • Fei ShaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)


Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (hse), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.


Hierarchical sequence embedding Video text retrieval Video description generation Action recognition Zero-shot transfer 



We appreciate the feedback from the reviewers. This work is partially supported by NSF IIS-1065243, 1451412, 1513966/ 1632803/1833137, 1208500, CCF-1139148, a Google Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.

Supplementary material

474201_1_En_23_MOESM1_ESM.pdf (9.4 mb)
Supplementary material 1 (pdf 9665 KB)


  1. 1.
    Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5804–5813 (2017)Google Scholar
  2. 2.
    Antol, S., et al.: Vqa: Visual question answering. In: ICCV, pp. 2425–2433 (2015)Google Scholar
  3. 3.
    Chao, W.L., Hu, H., Sha, F.: Being negative but constructively: lessons learnt from creating better visual question answering datasets. In: NAACL-HLT, pp. 431–441 (2018)Google Scholar
  4. 4.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling (2014). arXiv preprint arXiv:1412.3555
  5. 5.
    Collell, G., Moens, M.F.: Is an image worth more than a thousand words? on the fine-grain semantic differences between visual and linguistic representations. In: COLING, pp. 2807–2817 (2016)Google Scholar
  6. 6.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)Google Scholar
  7. 7.
    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  9. 9.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970Google Scholar
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Hu, H., Chao, W.L., Sha, F.: Learning answer embeddings for visual question answering. In: CVPR, pp. 5428–5436 (2018)Google Scholar
  12. 12.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)Google Scholar
  13. 13.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)Google Scholar
  14. 14.
    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset (2017). arXiv preprint arXiv:1705.06950
  15. 15.
    Kiela, D., Bottou, L.: Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics. In: EMNLP, pp. 36–45 (2014)Google Scholar
  16. 16.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint arXiv:1411.2539
  17. 17.
    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)Google Scholar
  18. 18.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  19. 19.
    Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphs and documents. In: ACL, pp. 1106–1115 (2015)Google Scholar
  20. 20.
    Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: CVPR, pp. 7492–7500 (2018)Google Scholar
  21. 21.
    Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV, pp. 740–755 (2014)Google Scholar
  22. 22.
    Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP, pp. 1412–1421 (2015)Google Scholar
  23. 23.
    Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)zbMATHGoogle Scholar
  24. 24.
    Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Hierarchical multimodal lstm for dense visual-semantic embedding. In: ICCV, pp. 1899–1907 (2017)Google Scholar
  25. 25.
    Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)Google Scholar
  26. 26.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML, pp. 1310–1318 (2013)Google Scholar
  27. 27.
    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  28. 28.
    Qiu, Z., Yao, T., Mei, T.: Deep quantization: Encoding convolutional activations with deep generative model. In: CVPR, pp. 4085–4094 (2017)Google Scholar
  29. 29.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)Google Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  31. 31.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv:1212.0402
  32. 32.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)Google Scholar
  33. 33.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)Google Scholar
  34. 34.
    Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: ICCV, pp. 3591–3600 (2017)Google Scholar
  35. 35.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV, pp. 4534–4542 (2015)Google Scholar
  36. 36.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL-HLT, pp. 1494–1504 (2015)Google Scholar
  37. 37.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)Google Scholar
  38. 38.
    Wang, L., et al.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV, pp. 20–36 (2016)Google Scholar
  39. 39.
    Wang, L., et al.: Temporal segment networks for action recognition in videos (2017). arXiv preprint arXiv:1705.02953
  40. 40.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR, pp. 5005–5013 (2016)Google Scholar
  41. 41.
    Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: CVPR (2018)Google Scholar
  42. 42.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  43. 43.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)Google Scholar
  44. 44.
    Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector cnns. In: CVPR, pp. 2718–2726 (2016)Google Scholar
  45. 45.
    Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: ECCV, pp. 766–782 (2016)Google Scholar
  46. 46.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, pp. 2933–2942 (2017)Google Scholar
  47. 47.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question answering in images. In: CVPR, pp. 4995–5004 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Deptartment of Computer ScienceUniversity of Southern CaliforniaLos AngelesUSA
  2. 2.Netflix, 5808 Sunset BlvdLos AngelesUSA

Personalised recommendations