Graph Edit Distance Reward: Learning to Edit Scene Graph

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12364)


Scene Graph, as a vital tool to bridge the gap between language domain and image domain, has been widely adopted in the cross-modality task like VQA. In this paper, we propose a new method to edit the scene graph according to the user instructions, which has never been explored. To be specific, in order to learn editing scene graphs as the semantics given by texts, we propose a Graph Edit Distance Reward, which is based on the Policy Gradient and Graph Matching algorithm, to optimize neural symbolic model. In the context of text-editing image retrieval, we validate the effectiveness of our method in CSS and CRIR dataset. Besides, CRIR is a new synthetic dataset generated by us, which we will publish soon for future use.


Scene graph editing Policy gradient Graph matching 



This research was supported by the National Research Foundation Singapore under its AI Singapore Programme (Award Number: AISG-RP-2018-003) and the MOE Tier-1 research grants: RG28/18 (S) and RG22/19 (S). Q. Wu’s participation was supported by NSFC 61876208, Key-Area Research and Development Program of Guangdong 2018B010108002.


  1. 1.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: CVPR (2016)Google Scholar
  2. 2.
    Antol, S., et al.: VQA: visual question answering. In: CVPR (2015)Google Scholar
  3. 3.
    Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: CVPR (2015)Google Scholar
  4. 4.
    Bunke, H.: Error correcting graph matching: on the influence of the underlying cost function. TPAMI 21(9), 917–922 (1999)CrossRefGoogle Scholar
  5. 5.
    Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(03), 265–298 (2004)CrossRefGoogle Scholar
  6. 6.
    Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) Google Scholar
  8. 8.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: ICCV (2017)Google Scholar
  9. 9.
    Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)Google Scholar
  10. 10.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)Google Scholar
  11. 11.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: CVPR (2017)Google Scholar
  12. 12.
    Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)Google Scholar
  13. 13.
    Kim, J.H., et al.: Multimodal residual learning for visual QA. In: Advances in Neural Information Processing Systems, pp. 361–369 (2016)Google Scholar
  14. 14.
    Lin, K., Yang, H.F., Hsiao, J.H., Chen, C.S.: Deep learning of binary hash codes for fast image retrieval. In: CVPR Workshops (2015)Google Scholar
  15. 15.
    Liu, H., Wang, R., Shan, S., Chen, X.: Deep supervised hashing for fast image retrieval. In: CVPR (2016)Google Scholar
  16. 16.
    Liu, R., Liu, C., Bai, Y., Yuille, A.L.: CLEVR-REF+: diagnosing visual reasoning with referring expressions. In: CVPR (2019)Google Scholar
  17. 17.
    Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)Google Scholar
  18. 18.
    Luo, R., Price, B., Cohen, S., Shakhnarovich, G.: Discriminability objective for training descriptive captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2018)Google Scholar
  19. 19.
    Minsky, M.L.: Logical versus analogical or symbolic versus connectionist or neat versus scruffy. AI Mag. 12(2), 34–34 (1991)Google Scholar
  20. 20.
    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FILM: visual reasoning with a general conditioning layer. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  21. 21.
    Riesen, K., Fankhauser, S., Bunke, H.: Speeding up graph edit distance computation with a bipartite heuristic. In: MLG (2007)Google Scholar
  22. 22.
    Sanfeliu, A., Fu, K.S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Syst. Man Cybern. SMC-13(3), 353–362 (1983)Google Scholar
  23. 23.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems, pp. 4967–4976 (2017)Google Scholar
  24. 24.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  26. 26.
    Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)Google Scholar
  27. 27.
    Wang, H., Sahoo, D., Liu, C., Lim, E.P., Hoi, S.C.H.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In: CVPR (2019)Google Scholar
  28. 28.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)zbMATHGoogle Scholar
  29. 29.
    Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)Google Scholar
  30. 30.
    Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NIPS (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Nanyang Technological UniversitySingaporeSingapore
  2. 2.Zhejiang UniversityHangzhouChina
  3. 3.Huazhong University of Science and TechnologyWuhanChina
  4. 4.School of Software EngineeringSouth China University of TechnologyGuangzhouChina
  5. 5.Key Laboratory of Big Data and Intelligent Robot, Ministry of EducationBeijingChina

Personalised recommendations