Advertisement

Structured Matching for Phrase Localization

  • Mingzhe Wang
  • Mahmoud Azab
  • Noriyuki Kojima
  • Rada Mihalcea
  • Jia Deng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9912)

Abstract

In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then define the similarities (matching weights) between regions and phrases. We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical effectiveness of our approach.

Keywords

Vision Language 

References

  1. 1.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)Google Scholar
  2. 2.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. arXiv preprint arXiv:1511.06078 (2015)
  3. 3.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. arXiv preprint arXiv:1511.03745 (2015)
  4. 4.
    Karpathy, A., Joulin, A., Li, F.F.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in Neural Information Processing Systems, pp. 1889–1897 (2014)Google Scholar
  5. 5.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  6. 6.
    Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3558–3565 (2014)Google Scholar
  7. 7.
    Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Robotics: Science and Systems (2014)Google Scholar
  8. 8.
    Arandjelovic, R., Zisserman, A.: Multiple queries for large scale specific object retrieval. In: BMVC, pp. 1–11 (2012)Google Scholar
  9. 9.
    Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. arXiv preprint arXiv:1511.04164 (2015)
  10. 10.
    Klein, B., Lev, G., Sadeh, G., Wolf, L.: Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399 (2014)
  11. 11.
    Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 529–545. Springer, Heidelberg (2014)Google Scholar
  12. 12.
    Hoi, S.C., Liu, W., Lyu, M.R., Ma, W.Y.: Learning distance metrics with contextual constraints for image retrieval. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, vol. 2. IEEE pp. 2072–2078 (2006)Google Scholar
  13. 13.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
  14. 14.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. arXiv preprint arXiv:1411.4555 (2014)
  15. 15.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR, Citeseer (2011)Google Scholar
  16. 16.
    Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)
  17. 17.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389 (2014)
  18. 18.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)Google Scholar
  19. 19.
    Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)Google Scholar
  20. 20.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10602-1_26 Google Scholar
  21. 21.
    Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)Google Scholar
  22. 22.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  23. 23.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results(2007). http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
  24. 24.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  25. 25.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6(Sep), 1453–1484 (2005)zbMATHMathSciNetGoogle Scholar
  26. 26.
    Clark, K., Manning, C.D.: Entity-centric coreference resolution with model stacking. In: Association for Computational Linguistics (ACL) (2015)Google Scholar
  27. 27.
    Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)CrossRefzbMATHGoogle Scholar
  28. 28.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
  29. 29.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. arXiv preprint arXiv:1505.04870v3 (2015)

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Mingzhe Wang
    • 1
  • Mahmoud Azab
    • 1
  • Noriyuki Kojima
    • 1
  • Rada Mihalcea
    • 1
  • Jia Deng
    • 1
  1. 1.Computer Science and EngineeringUniversity of MichiganAnn ArborUSA

Personalised recommendations