Advertisement

Propagating Over Phrase Relations for One-Stage Visual Grounding

Conference paper
  • 659 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12364)

Abstract

Phrase level visual grounding aims to locate in an image the corresponding visual regions referred to by multiple noun phrases in a given sentence. Its challenge comes not only from large variations in visual contents and unrestricted phrase descriptions but also from unambiguous referrals derived from phrase relational reasoning. In this paper, we propose a linguistic structure guided propagation network for one-stage phrase grounding. It explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them. Specifically, we first construct a linguistic graph parsed from the sentence and then capture multimodal feature maps for all the phrasal nodes independently. The node features are then propagated over the edges with a tailor-designed relational propagation module and ultimately integrated for final prediction. Experiments on Flickr30K Entities dataset show that our model outperforms state-of-the-art methods and demonstrate the effectiveness of propagating among phrases with linguistic relations (Source code will be available at https://github.com/sibeiyang/lspn.).

Keywords

One-stage phrase grounding Linguistic graph Relational propagation Visual grounding 

Notes

Acknowledgment

This work is partially supported by the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2020B1515020048 and the National Natural Science Foundation of China under Grant No. U1811463.

References

  1. 1.
    Bajaj, M., Wang, L., Sigal, L.: G3raphground: graph-based language grounding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  2. 2.
    Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 824–832 (2017)Google Scholar
  3. 3.
    Dogan, P., Sigal, L., Gross, M.: Neural sequential phrase grounding (seqground). In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4175–4184 (2019)Google Scholar
  4. 4.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2016)Google Scholar
  5. 5.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  6. 6.
    Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124 (2017)Google Scholar
  7. 7.
    Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564 (2016)Google Scholar
  8. 8.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)Google Scholar
  9. 9.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  10. 10.
    Liu, D., Zhang, H., Zha, Z.J., Feng, W.: Learning to assemble neural module tree networks for visual grounding. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  11. 11.
    Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  12. 12.
    Plummer, B.A., Kordas, P., Kiapour, M.H., Zheng, S., Piramuthu, R., Lazebnik, S.: Conditional image-text embedding networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 258–274. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_16CrossRefGoogle Scholar
  13. 13.
    Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive image-language cues. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1928–1937 (2017)Google Scholar
  14. 14.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer VIsion, pp. 2641–2649 (2015)Google Scholar
  15. 15.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  16. 16.
    Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
  17. 17.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  18. 18.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_49CrossRefGoogle Scholar
  19. 19.
    Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4694–4703 (2019)Google Scholar
  20. 20.
    Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Workshop on Vision and Language (VL15). Association for Computational Linguistics, Lisbon, Portugal, September 2015Google Scholar
  21. 21.
    Tieleman, T., Hinton, G.: Lecture 6.5–RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012)Google Scholar
  22. 22.
    Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)CrossRefGoogle Scholar
  23. 23.
    Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)Google Scholar
  24. 24.
    Wang, M., Azab, M., Kojima, N., Mihalcea, R., Deng, J.: Structured matching for phrase localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 696–711. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_42CrossRefGoogle Scholar
  25. 25.
    Yang, S., Li, G., Yu, Y.: Cross-modal relationship inference for grounding referring expressions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4145–4154 (2019)Google Scholar
  26. 26.
    Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4644–4653 (2019)Google Scholar
  27. 27.
    Yang, S., Li, G., Yu, Y.: Graph-structured referring expression reasoning in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  28. 28.
    Yang, S., Li, G., Yu, Y.: Relationship-embedded representation learning for grounding referring expressions. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)Google Scholar
  29. 29.
    Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4683–4693 (2019)Google Scholar
  30. 30.
    Yeh, R., Xiong, J., Hwu, W.M., Do, M., Schwing, A.: Interpretable and globally optimal prediction for textual grounding using image concepts. In: Advances in Neural Information Processing Systems, pp. 1912–1922 (2017)Google Scholar
  31. 31.
    Yu, L., et al.: MattNet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The University of Hong KongPokfulamHong Kong
  2. 2.Sun Yat-sen UniversityGuangzhouChina

Personalised recommendations