MSRC: multimodal spatial regression with semantic context for phrase grounding


Given a textual description of an image, phrase grounding localizes objects in the image referred by query phrases in the description. State-of-the-art methods treat phrase grounding as a ranking problem and address it by retrieving a set of proposals according to the query’s semantics, which are limited by the performance of independent proposal generation systems and ignore useful cues from context in the description. In this paper, we propose a novel multimodal spatial regression with semantic context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. There are two advantages of MSRC: First, it sidesteps the performance upper bound from independent proposal generation systems by adopting regression mechanism. Second, MSRC not only encodes the semantics of a query phrase, but also considers its relation with context (i.e., other queries from the same sentence) via a context refinement network. Experiments show MSRC system achieves a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64 and 5.28% increase over the state of the arts, respectively.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

    Andrej K, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR

  2. 2.

    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence ZC, Parikh D (2015) Vqa: visual question answering. In: ICCV

  3. 3.

    Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2016) ABC-CNN: an attention based convolutional neural network for visual question answering. In: CVPR Workshop

  4. 4.

    Chen K, Bui T, Fang C, Wang Z, Nevatia R (2017) AMC: attention guided multi-modal correlation learning for image search. In: CVPR

  5. 5.

    Chen K, Kovvuri R, Nevatia R (2017) Query-guided regression network with context policy for phrase grounding. In: ICCV

  6. 6.

    Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: CVPR

  7. 7.

    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL Visual Object Classes Challenge. In: IJCV

  8. 8.

    Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: CVPR

  9. 9.

    Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP

  10. 10.

    Girshick R (2015) Fast r-cnn. In: ICCV

  11. 11.

    Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Aistats

  12. 12.

    Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In: ECCV

  13. 13.

    He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: CVPR

  14. 14.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780

    Article  Google Scholar 

  15. 15.

    Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR

  16. 16.

    Justin J, Andrej K, Li FF (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPR

  17. 17.

    Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) Referit game: referring to objects in photographs of natural scenes. In: EMNLP

  18. 18.

    Kantorov V, Oquab M, Cho M, Laptev I (2016) Contextlocnet: context-aware deep network models for weakly supervised localization. In: ECCV

  19. 19.

    Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS

  20. 20.

    Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: ICLR

  21. 21.

    Krishnamurthy J, Kollar T (2013) Jointly learning to parse and perceive: Connecting natural language to the physical world. In: TACL

  22. 22.

    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV

  23. 23.

    Matuszek C, FitzGerald N, Zettlemoyer L, Bo L, Fox D (2012) A joint model of language and perception for grounded attribute learning. In: ICML

  24. 24.

    Nagaraja VK, Morariu VI, Davis LS (2016) Modeling context between objects for referring expression understanding. In: ECCV

  25. 25.

    Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2016) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IJCV

  26. 26.

    Radenović F, Tolias G, Chum O (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In: ECCV

  27. 27.

    Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified real-time object detection. In: CVPR

  28. 28.

    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: NIPS

  29. 29.

    Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV

  30. 30.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: CoRR

  31. 31.

    Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. In: IJCV

  32. 32.

    Wang M, Azab M, Kojima N, Mihalcea R, Deng J (2016) Structured matching for phrase localization. In: ECCV

  33. 33.

    Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: ECCV

  34. 34.

    Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: ECCV

Download references


This paper is based, in part, on research sponsored by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under Agreement No. FA8750-16-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Defense Advanced Research Projects Agency or the U.S. Government.

Author information



Corresponding author

Correspondence to Kan Chen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, K., Kovvuri, R., Gao, J. et al. MSRC: multimodal spatial regression with semantic context for phrase grounding. Int J Multimed Info Retr 7, 17–28 (2018).

Download citation


  • Phrase grounding
  • Spatial regression
  • Multimodal
  • context