Graph R-CNN for Scene Graph Generation

  • Jianwei YangEmail author
  • Jiasen Lu
  • Stefan Lee
  • Dhruv Batra
  • Devi Parikh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.


Graph R-CNN Scene graph generation Relation proposal network Attentional graph convolutional network 



This work was supported in part by NSF, AFRL, DARPA, Siemens, Google, Amazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}.


  1. 1.
    Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: CVPR (2017)Google Scholar
  3. 3.
    Das, A., et al.: Visual dialog. In: CVPR (2017)Google Scholar
  4. 4.
    Everingham, M., Van Gool, L., Williams, C., Winn, J., Zisserman, A.: The Pascal visual object classes challenge 2012 results, vol. 5 (2012).
  5. 5.
    Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Girshick, R.: Fast R-CNN. In: CVPR (2015)Google Scholar
  7. 7.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  8. 8.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  10. 10.
    Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR (2018)Google Scholar
  11. 11.
    Johnson, J., et al.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)Google Scholar
  12. 12.
    Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)Google Scholar
  13. 13.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)Google Scholar
  14. 14.
    Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  16. 16.
    Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph cut based inference with co-occurrence statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010). Scholar
  17. 17.
    Li, Y., Ouyang, W., Wang, X.: ViP-CNN: a visual phrase reasoning convolutional neural network for visual relationship detection. In: CVPR (2017)Google Scholar
  18. 18.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: ICCV (2017)Google Scholar
  19. 19.
    Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: CVPR (2017)Google Scholar
  20. 20.
    Lin, C.-L.: Hardness of approximating graph transformation problem. In: Du, D.-Z., Zhang, X.-S. (eds.) ISAAC 1994. LNCS, vol. 834, pp. 74–82. Springer, Heidelberg (1994). Scholar
  21. 21.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  22. 22.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  23. 23.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  24. 24.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: CVPR (2018)Google Scholar
  25. 25.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)Google Scholar
  26. 26.
    Newell, A., Deng, J.: Pixels to graphs by associative embedding. In: NIPS (2017)Google Scholar
  27. 27.
    Oliva, A., Torralba, A.: The role of context in object recognition. Trends Cogn. Sci. 11(12), 520–527 (2007)CrossRefGoogle Scholar
  28. 28.
    Parikh, D., Zitnick, C.L., Chen, T.: From appearance to context-based recognition: Dense labeling in small images. In: CVPR (2008)Google Scholar
  29. 29.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-supervised learning of visual relations. In: ICCV (2017)Google Scholar
  30. 30.
    Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: ICCV (2007)Google Scholar
  31. 31.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  33. 33.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  34. 34.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  35. 35.
    Teney, D., Liu, L., Hengel, A.V.d.: Graph-structured representations for visual question answering. In: CVPR (2017)Google Scholar
  36. 36.
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)Google Scholar
  37. 37.
    Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. In: PAMI (2017)Google Scholar
  38. 38.
    Wang, P., Wu, Q., Shen, C., van den Hengel, A.: The VQA-machine: learning how to use existing vision algorithms to answer new questions. In: CVPR (2017)Google Scholar
  39. 39.
    Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. In: PAMI (2017)Google Scholar
  40. 40.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR (2017)Google Scholar
  41. 41.
    Yang, J., Lu, J., Batra, D., Parikh, D.: A faster Pytorch implementation of faster R-CNN (2017).
  42. 42.
    Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: CVPR (2018)Google Scholar
  43. 43.
    Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR (2017)Google Scholar
  44. 44.
    Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: Ppr-fcn: weakly supervised visual relation detection via parallel pairwise r-fcn (2017)Google Scholar
  45. 45.
    Zhang, J., Elhoseiny, M., Cohen, S., Chang, W., Elgammal, A.: Relationship proposal networks. In: CVPR (2017)Google Scholar
  46. 46.
    Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Georgia Institute of TechnologyAtlantaUSA
  2. 2.Facebook AI ResearchMenlo ParkUSA

Personalised recommendations