Factorizable Net: An Efficient Subgraph-Based Framework for Scene Graph Generation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


Generating scene graph to describe the object interactions inside an image gains increasing interests these years. However, most of the previous methods use complicated structures with slow inference speed or rely on the external data, which limits the usage of the model in real-life scenarios. To improve the efficiency of scene graph generation, we propose a subgraph-based connection graph to concisely represent the scene graph during the inference. A bottom-up clustering method is first used to factorize the entire graph into subgraphs, where each subgraph contains several objects and a subset of their relationships. By replacing the numerous relationship representations of the scene graph with fewer subgraph and object features, the computation in the intermediate stage is significantly reduced. In addition, spatial information is maintained by the subgraph features, which is leveraged by our proposed Spatial-weighted Message Passing (SMP) structure and Spatial-sensitive Relation Inference (SRI) module to facilitate the relationship recognition. On the recent Visual Relationship Detection and Visual Genome datasets, our method outperforms the state-of-the-art method in both accuracy and speed. Code has been made publicly available (


Visual Relationship Detection Scene graph generation Scene understanding Object interactions Language and vision 



This work is supported by Hong Kong Ph.D. Fellowship Scheme, SenseTime Group Limited, Samsung Telecommunication Research Institute, the General Research Fund sponsored by the Research Grants Council of Hong Kong (Project Nos. CUHK14213616, CUHK14206114, CUHK14205615, CUHK419412, CUHK14203015, CUHK14207814, CUHK14208417, CUHK14202217, and CUHK14239816), the Hong Kong Innovation and Technology Support Programme (No.ITS/121/15FX).

Supplementary material

474172_1_En_21_MOESM1_ESM.pdf (1.3 mb)
Supplementary material 1 (pdf 1341 KB)


  1. 1.
    Antol, S., Zitnick, C.L., Parikh, D.: Zero-shot learning via visual abstraction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 401–416. Springer, Cham (2014). Scholar
  2. 2.
    Berg, A.C., et al.: Understanding and predicting importance in images. In: CVPR (2012)Google Scholar
  3. 3.
    Chang, A., Savva, M., Manning, C.: Semantic parsing for text to 3D scene generation. In: ACL (2014)Google Scholar
  4. 4.
    Choi, M.J., Lim, J.J., Torralba, A., Willsky, A.S.: Exploiting hierarchical context on a large database of object categories. In: CVPR (2010)Google Scholar
  5. 5.
    Choi, W., Chao, Y.W., Pantofaru, C., Savarese, S.: Understanding indoor scenes using 3D geometric phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 33–40 (2013)Google Scholar
  6. 6.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: CVPR (2017)Google Scholar
  7. 7.
    Das, P., Xu, C., Doell, R.F., Corso, J.J.: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: CVPR (2013)Google Scholar
  8. 8.
    Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014)Google Scholar
  9. 9.
    Elhoseiny, M., Cohen, S., Chang, W., Price, B.L., Elgammal, A.M.: Sherlock: scalable fact learning in images. In: AAAI (2017)Google Scholar
  10. 10.
    Elliott, D., Keller, F.: Image description using visual dependency representations. In: EMNLP (2013)Google Scholar
  11. 11.
    Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). Scholar
  12. 12.
    Fidler, S., Leonardis, A.: Towards scalable representations of object categories: learning a hierarchy of parts. In: CVPR (2007)Google Scholar
  13. 13.
    Galleguillos, C., Belongie, S.: Context based object categorization: a critical survey. In: CVIU (2010)Google Scholar
  14. 14.
    Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-occurrence, location and appearance. In: CVPR (2008)Google Scholar
  15. 15.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  16. 16.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R* CNN. In: ICCV (2015)Google Scholar
  17. 17.
    Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106, 210–233 (2014)CrossRefGoogle Scholar
  18. 18.
    Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. IJCV 80, 300–316 (2008)CrossRefGoogle Scholar
  19. 19.
    Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV (2013)Google Scholar
  20. 20.
    Gupta, A., Davis, L.S.: Beyond nouns: exploiting prepositions and comparative adjectives for learning visual classifiers. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 16–29. Springer, Heidelberg (2008). Scholar
  21. 21.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  22. 22.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). arXiv preprint: arXiv:1512.03385
  23. 23.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. IJCV 80, 3–15 (2008)CrossRefGoogle Scholar
  24. 24.
    Izadinia, H., Sadeghi, F., Farhadi, A.: Incorporating scene context and object layout into appearance modeling. In: CVPR (2014)Google Scholar
  25. 25.
    Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning (2015). arXiv preprint: arXiv:1511.07571
  26. 26.
    Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)Google Scholar
  27. 27.
    Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS (2014)Google Scholar
  28. 28.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  30. 30.
    Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: CVPR (2011)Google Scholar
  31. 31.
    Kumar, M.P., Koller, D.: Efficiently selecting regions for scene understanding. In: CVPR (2010)Google Scholar
  32. 32.
    Ladicky, L., Russell, C., Kohli, P., Torr, P.H.S.: Graph cut based inference with co-occurrence statistics. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 239–253. Springer, Heidelberg (2010). Scholar
  33. 33.
    Li, Y., et al.: Visual question generation as dual task of visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6116–6124 (2018)Google Scholar
  34. 34.
    Li, Y., Ouyang, W., Wang, X., Tang, X.: ViP-CNN: visual phrase guided convolutional neural network. In: CVPR (2017)Google Scholar
  35. 35.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: ICCV (2017)Google Scholar
  36. 36.
    Liao, W., Shuai, L., Rosenhahn, B., Yang, M.Y.: Natural language guided visual relationship detection (2017). arXiv preprint: arXiv:1711.06032
  37. 37.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  38. 38.
    Lu, P., Li, H., Wei, Z., Wang, J., Wang, X.: Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: AAAI (2018)Google Scholar
  39. 39.
    Mensink, T., Gavves, E., Snoek, C.G.: Costa: co-occurrence statistics for zero-shot classification. In: CVPR (2014)Google Scholar
  40. 40.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010)Google Scholar
  41. 41.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-supervised learning of visual relations. In: ICCV (2017)Google Scholar
  42. 42.
    Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive linguistic cues. In: ICCV (2017)Google Scholar
  43. 43.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)Google Scholar
  44. 44.
    Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: ICCV (2007)Google Scholar
  45. 45.
    Ramanathan, V., et al.: Learning semantic relationships for better action retrieval in images. In: CVPR (2015)Google Scholar
  46. 46.
    Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. In: ACL (2013)Google Scholar
  47. 47.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  48. 48.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction (2015). arXiv preprint: arXiv:1511.03745
  49. 49.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV (2013)Google Scholar
  50. 50.
    Russell, B.C., Freeman, W.T., Efros, A.A., Sivic, J., Zisserman, A.: Using multiple segmentations to discover objects and their extent in image collections. In: CVPR (2006)Google Scholar
  51. 51.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR (2011)Google Scholar
  52. 52.
    Salakhutdinov, R., Torralba, A., Tenenbaum, J.: Learning to share visual appearance for multiclass object detection. In: CVPR (2011)Google Scholar
  53. 53.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint: arXiv:1409.1556
  54. 54.
    Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: ICCV (2005)Google Scholar
  55. 55.
    Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: COLING (2014)Google Scholar
  56. 56.
    Xiong, Y., Zhu, K., Lin, D., Tang, X.: Recognize complex events from static images by fusing deep channels. In: CVPR (2015)Google Scholar
  57. 57.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR (2017)Google Scholar
  58. 58.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2015). arXiv preprint: arXiv:1502.03044
  59. 59.
    Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: CVPR (2010)Google Scholar
  60. 60.
    Yao, J., Fidler, S., Urtasun, R.: Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: CVPR (2012)Google Scholar
  61. 61.
    Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: ICCV (2017)Google Scholar
  62. 62.
    Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR (2017)Google Scholar
  63. 63.
    Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition. In: ICCV (2017)Google Scholar
  64. 64.
    Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: ICCV (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.The Chinese University of Hong KongHong KongHong Kong SAR, China
  2. 2.SenseTime Computer Vision Research GroupThe University of SydneySydneyAustralia
  3. 3.MIT CSAILCambridgeUSA
  4. 4.Sensetime Ltd.BeijingChina
  5. 5.Samsung Telecommunication Research InstituteBeijingChina

Personalised recommendations