Learning Joint Representations of Videos and Sentences with Web Image Search

  • Mayu Otani
  • Yuta Nakashima
  • Esa Rahtu
  • Janne Heikkilä
  • Naokazu Yokoya
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9913)


Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.


Video retrieval Sentence retrieval Multimodal embedding Neural network Image search Representation learning 



This work is partly supported by JSPS KAKENHI No. 16K16086.

Supplementary material

431902_1_En_46_MOESM1_ESM.pdf (3.4 mb)
Supplementary material 1 (pdf 3435 KB)


  1. 1.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)Google Scholar
  2. 2.
    Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollr, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325, 7 pages (2015)
  3. 3.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR, pp. 539–546 (2005)Google Scholar
  4. 4.
    Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 5: 1–5: 60 (2008)CrossRefGoogle Scholar
  5. 5.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)Google Scholar
  6. 6.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)Google Scholar
  7. 7.
    Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_2 CrossRefGoogle Scholar
  8. 8.
    Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from Google’s image search. In: ICCV, pp. 1816–1823 (2005)Google Scholar
  9. 9.
    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M.A., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)Google Scholar
  10. 10.
    Girshick, R., Donahue, J., Darrell, T., Berkeley, U.C., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)Google Scholar
  11. 11.
    Guadarrama, S., Venugopalan, S., Austin, U.T., Krishnamoorthy, N., Mooney, R., Malkarnenkar, G., Darrell, T., Berkeley, U.C.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp. 2712–2719 (2013)Google Scholar
  12. 12.
    Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: CVPR, pp. 3090–3098 (2015)Google Scholar
  13. 13.
    Johnson, J., Ballan, L., Fei-Fei, L.: Love thy neighbors: image annotation by exploiting image metadata. In: ICCV, pp. 4624–4632 (2015)Google Scholar
  14. 14.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS, pp. 1889–1897 (2014)Google Scholar
  15. 15.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR, 11 pages (2015)Google Scholar
  16. 16.
    Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: NIPS, pp. 3276–3284 (2015)Google Scholar
  17. 17.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196 (2014)Google Scholar
  18. 18.
    Lin, D., Fidler, S., Kong, C., Urtasun, R.: Visual semantic search: retrieving videos via complex textual queries. In: CVPR, pp. 2657–2664 (2014)Google Scholar
  19. 19.
    Lin, T.Y., Belongie, S., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: CVPR, pp. 5007–5015 (2015)Google Scholar
  20. 20.
    Maybank, S.: A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 41(6), 797–819 (2011)CrossRefGoogle Scholar
  21. 21.
    Ordonez, V., Kulkarni, G., Berg, T.: Im2Text: describing images using 1 million captioned photographs. In: NIPS, pp. 1143–1151 (2011)Google Scholar
  22. 22.
    Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL-HLT, pp. 139–147 (2010)Google Scholar
  23. 23.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  24. 24.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV, pp. 433–440 (2013)Google Scholar
  25. 25.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recoginition. In: ICLR, p. 14 (2015)Google Scholar
  27. 27.
    Socher, R., Ganjoo, M., Manning, C.D., Ng, A.Y.: Zero-shot learning through cross-modal transfer. In: NIPS, pp. 935–943 (2013)Google Scholar
  28. 28.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)Google Scholar
  29. 29.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  30. 30.
    Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: NIPS, 6 pages (2015)Google Scholar
  31. 31.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: NAACL-HLT, pp. 1494–1504 (2014)Google Scholar
  32. 32.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)Google Scholar
  33. 33.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV, pp. 2794–2802 (2015)Google Scholar
  34. 34.
    Xu, R., Xiong, C., Chen, W., Corso, J.: Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp. 2346–2352 (2015)Google Scholar
  35. 35.
    Yao, L., Ballas, N., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)Google Scholar
  36. 36.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014). Google Scholar
  37. 37.
    Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Mayu Otani
    • 1
  • Yuta Nakashima
    • 1
  • Esa Rahtu
    • 2
  • Janne Heikkilä
    • 2
  • Naokazu Yokoya
    • 1
  1. 1.Graduate School of Information ScienceNara Institute of Science and TechnologyIkomaJapan
  2. 2.Center for Machine Vision and Signal AnalysisUniversity of OuluOuluFinland

Personalised recommendations