Learning Relationship-Aware Visual Features

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


Relational reasoning in Computer Vision has recently shown impressive results on visual question answering tasks. On the challenging dataset called CLEVR, the recently proposed Relation Network (RN), a simple plug-and-play module and one of the state-of-the-art approaches, has obtained a very good accuracy (95.5%) answering relational questions. In this paper, we define a sub-field of Content-Based Image Retrieval (CBIR) called Relational-CBIR (R-CBIR), in which we are interested in retrieving images with given relationships among objects. To this aim, we employ the RN architecture in order to extract relation-aware features from CLEVR images. To prove the effectiveness of these features, we extended both CLEVR and Sort-of-CLEVR datasets generating a ground-truth for R-CBIR by exploiting relational data embedded into scene-graphs. Furthermore, we propose a modification of the RN module – a two-stage Relation Network (2S-RN) – that enabled us to extract relation-aware features by using a preprocessing stage able to focus on the image content, leaving the question apart. Experiments show that our RN features, especially the 2S-RN ones, outperform the RMAC state-of-the-art features on this new challenging task.


CLEVR Content-based image retrieval Deep learning Relational reasoning Relation networks Deep features 



This work was partially supported by Smart News, Social sensing for breaking news, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008, and Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009.

We are very grateful to the DeepMind team (Santoro et al.), that kindly assisted us during the replication of their work on Relation Networks.

We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.


  1. 1.
    Krawczyk, D.C., McClelland, M.M., Donovan, C.M.: A hierarchy for relational reasoning in the prefrontal cortex. Cortex 47, 588–597 (2011)CrossRefGoogle Scholar
  2. 2.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  3. 3.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning (2017)Google Scholar
  4. 4.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations (2016)Google Scholar
  5. 5.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  6. 6.
    Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-supervised learning of visual relations. In: ICCV 2017 - International Conference on Computer Vision 2017, Venice, Italy, October 2017Google Scholar
  7. 7.
    Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)Google Scholar
  8. 8.
    Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 108–124. Springer, Cham (2016). Scholar
  9. 9.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3298–3308. IEEE (2017)Google Scholar
  10. 10.
    Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A.M., Elhoseiny, M.: Large-scale visual relationship understanding. CoRR abs/1804.10660 (2018)Google Scholar
  11. 11.
    Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. CoRR abs/1512.02167 (2015)Google Scholar
  12. 12.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. CoRR abs/1511.02274 (2015)Google Scholar
  13. 13.
    Santoro, A., et al.: A simple neural network module for relational reasoning. CoRR abs/1706.01427 (2017)Google Scholar
  14. 14.
    Raposo, D., Santoro, A., Barrett, D.G.T., Pascanu, R., Lillicrap, T.P., Battaglia, P.W.: Discovering objects and their relations from entangled scene representations. CoRR abs/1702.05068 (2017)Google Scholar
  15. 15.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. CoRR abs/1704.05526 (2017)Google Scholar
  16. 16.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. CoRR abs/1705.03633 (2017)Google Scholar
  17. 17.
    Perez, E., de Vries, H., Strub, F., Dumoulin, V., Courville, A.C.: Learning visual reasoning without strong priors. CoRR abs/1707.03017 (2017)Google Scholar
  18. 18.
    Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: FiLM: visual reasoning with a general conditioning layer. CoRR abs/1709.07871 (2017)Google Scholar
  19. 19.
    Belilovsky, E., Blaschko, M.B., Kiros, J.R., Urtasun, R., Zemel, R.: Joint embeddings of scene graphs and images. In: ICLR (2017)Google Scholar
  20. 20.
    Cai, H., Zheng, V.W., Chang, K.C.: A comprehensive survey of graph embedding: problems, techniques and applications. CoRR abs/1709.07604 (2017)Google Scholar
  21. 21.
    Abu-Aisheh, Z., Raveaux, R., Ramel, J.Y., Martineau, P.: An exact graph edit distance algorithm for solving pattern recognition problems 1 (2015)Google Scholar
  22. 22.
    Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of bipartite graph matching. Image Vis. Comput. 27(7), 950–959 (2009). 7th IAPR-TC15 Workshop on Graph-based Representations (GbR 2007)CrossRefGoogle Scholar
  23. 23.
    Melucci, M.: On rank correlation in information retrieval evaluation. SIGIR Forum 41(1), 18–33 (2007)CrossRefGoogle Scholar
  24. 24.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
  25. 25.
    Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940 (2016)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.ISTI-CNRPisaItaly

Personalised recommendations