MSRC: multimodal spatial regression with semantic context for phrase grounding

  • Kan ChenEmail author
  • Rama Kovvuri
  • Jiyang Gao
  • Ram Nevatia
Regular Paper


Given a textual description of an image, phrase grounding localizes objects in the image referred by query phrases in the description. State-of-the-art methods treat phrase grounding as a ranking problem and address it by retrieving a set of proposals according to the query’s semantics, which are limited by the performance of independent proposal generation systems and ignore useful cues from context in the description. In this paper, we propose a novel multimodal spatial regression with semantic context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences. There are two advantages of MSRC: First, it sidesteps the performance upper bound from independent proposal generation systems by adopting regression mechanism. Second, MSRC not only encodes the semantics of a query phrase, but also considers its relation with context (i.e., other queries from the same sentence) via a context refinement network. Experiments show MSRC system achieves a significant improvement in accuracy on two popular datasets: Flickr30K Entities and Refer-it Game, with 6.64 and 5.28% increase over the state of the arts, respectively.


Phrase grounding Spatial regression Multimodal context 



This paper is based, in part, on research sponsored by the Air Force Research Laboratory and the Defense Advanced Research Projects Agency under Agreement No. FA8750-16-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Defense Advanced Research Projects Agency or the U.S. Government.


  1. 1.
    Andrej K, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPRGoogle Scholar
  2. 2.
    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence ZC, Parikh D (2015) Vqa: visual question answering. In: ICCVGoogle Scholar
  3. 3.
    Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2016) ABC-CNN: an attention based convolutional neural network for visual question answering. In: CVPR WorkshopGoogle Scholar
  4. 4.
    Chen K, Bui T, Fang C, Wang Z, Nevatia R (2017) AMC: attention guided multi-modal correlation learning for image search. In: CVPRGoogle Scholar
  5. 5.
    Chen K, Kovvuri R, Nevatia R (2017) Query-guided regression network with context policy for phrase grounding. In: ICCVGoogle Scholar
  6. 6.
    Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: CVPRGoogle Scholar
  7. 7.
    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The PASCAL Visual Object Classes Challenge. In: IJCVGoogle Scholar
  8. 8.
    Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: CVPRGoogle Scholar
  9. 9.
    Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLPGoogle Scholar
  10. 10.
    Girshick R (2015) Fast r-cnn. In: ICCVGoogle Scholar
  11. 11.
    Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AistatsGoogle Scholar
  12. 12.
    Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In: ECCVGoogle Scholar
  13. 13.
    He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: CVPRGoogle Scholar
  14. 14.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780CrossRefGoogle Scholar
  15. 15.
    Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPRGoogle Scholar
  16. 16.
    Justin J, Andrej K, Li FF (2016) Densecap: fully convolutional localization networks for dense captioning. In: CVPRGoogle Scholar
  17. 17.
    Kazemzadeh S, Ordonez V, Matten M, Berg TL (2014) Referit game: referring to objects in photographs of natural scenes. In: EMNLPGoogle Scholar
  18. 18.
    Kantorov V, Oquab M, Cho M, Laptev I (2016) Contextlocnet: context-aware deep network models for weakly supervised localization. In: ECCVGoogle Scholar
  19. 19.
    Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPSGoogle Scholar
  20. 20.
    Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: ICLRGoogle Scholar
  21. 21.
    Krishnamurthy J, Kollar T (2013) Jointly learning to parse and perceive: Connecting natural language to the physical world. In: TACLGoogle Scholar
  22. 22.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCVGoogle Scholar
  23. 23.
    Matuszek C, FitzGerald N, Zettlemoyer L, Bo L, Fox D (2012) A joint model of language and perception for grounded attribute learning. In: ICMLGoogle Scholar
  24. 24.
    Nagaraja VK, Morariu VI, Davis LS (2016) Modeling context between objects for referring expression understanding. In: ECCVGoogle Scholar
  25. 25.
    Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S (2016) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IJCVGoogle Scholar
  26. 26.
    Radenović F, Tolias G, Chum O (2016) CNN image retrieval learns from bow: unsupervised fine-tuning with hard examples. In: ECCVGoogle Scholar
  27. 27.
    Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified real-time object detection. In: CVPRGoogle Scholar
  28. 28.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: NIPSGoogle Scholar
  29. 29.
    Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCVGoogle Scholar
  30. 30.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: CoRRGoogle Scholar
  31. 31.
    Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. In: IJCVGoogle Scholar
  32. 32.
    Wang M, Azab M, Kojima N, Mihalcea R, Deng J (2016) Structured matching for phrase localization. In: ECCVGoogle Scholar
  33. 33.
    Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: ECCVGoogle Scholar
  34. 34.
    Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: ECCVGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2017

Authors and Affiliations

  1. 1.Institute for Robotics and Intelligent SystemsUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations