Learning visual features for relational CBIR

  • Nicola MessinaEmail author
  • Giuseppe Amato
  • Fabio Carrara
  • Fabrizio Falchi
  • Claudio Gennaro
Regular Paper


Recent works in deep-learning research highlighted remarkable relational reasoning capabilities of some carefully designed architectures. In this work, we employ a relationship-aware deep learning model to extract compact visual features used relational image descriptors. In particular, we are interested in relational content-based image retrieval (R-CBIR), a task consisting in finding images containing similar inter-object relationships. Inspired by the relation networks (RN) employed in relational visual question answering (R-VQA), we present novel architectures to explicitly capture relational information from images in the form of network activations that can be subsequently extracted and used as visual features. We describe a two-stage relation network module (2S-RN), trained on the R-VQA task, able to collect non-aggregated visual features. Then, we propose the aggregated visual features relation network (AVF-RN) module that is able to produce better relationship-aware features by learning the aggregation directly inside the network. We employ an R-CBIR ground-truth built by exploiting scene-graphs similarities available in the CLEVR dataset in order to rank images in a relational fashion. Experiments show that features extracted from our 2S-RN model provide an improved retrieval performance with respect to standard non-relational methods. Moreover, we demonstrate that the features extracted from the novel AVF-RN can further improve the performance measured on the R-CBIR task, reaching the state-of-the-art on the proposed dataset.


CLEVR Content-based image retrieval Deep learning Relational reasoning Relation networks Deep features 



This work was partially supported by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by the AI4EU project, funded by the EC (H2020—Contract no. 825619). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.


  1. 1.
    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. CoRR arXiv:1505.00468
  2. 2.
    Belilovsky E, Blaschko MB, Kiros JR, Urtasun R, Zemel R (2017) Joint embeddings of scene graphs and images. ICLRGoogle Scholar
  3. 3.
    Cai H, Zheng VW, Chang KC (2017) A comprehensive survey of graph embedding: problems, techniques and applications. CoRR arXiv:1709.07604
  4. 4.
    Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3298–3308. IEEEGoogle Scholar
  5. 5.
    Gordo A, Almazan J, Revaud J, Larlus D (2016) End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940
  6. 6.
    Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. In: The IEEE international conference on computer vision (ICCV)Google Scholar
  7. 7.
    Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Zitnick CL, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoningGoogle Scholar
  8. 8.
    Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Inferring and executing programs for visual reasoning. In: The IEEE international conference on computer vision (ICCV)Google Scholar
  9. 9.
    Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678Google Scholar
  10. 10.
    Kahou SE, Atkinson A, Michalski V, Kádár Á, Trischler A, Bengio Y (2017) Figureqa: an annotated figure dataset for visual reasoning. CoRR arXiv:1710.07300
  11. 11.
    Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotationsGoogle Scholar
  12. 12.
    Kuznetsova A, Rom H, Alldrin N, Uijlings JRR, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T, Ferrari V (2018) The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR arXiv:1811.00982
  13. 13.
    Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer visionGoogle Scholar
  14. 14.
    Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-VQA: learning visual relation facts with semantic attention for visual question answering. In: SIGKDD 2018Google Scholar
  15. 15.
    Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems 27. Curran Associates Inc, pp 1682–1690Google Scholar
  16. 16.
    Mascharka D, Tran P, Soklaski R, Majumdar A (2018) Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: The IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  17. 17.
    Melucci M (2007) On rank correlation in information retrieval evaluation. SIGIR Forum 41(1):18–33. CrossRefGoogle Scholar
  18. 18.
    Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2019) Learning relationship-aware visual features. In: Leal-Taixé L, Roth S (eds) Computer vision: ECCV 2018 workshops. Springer, Cham, pp 486–501CrossRefGoogle Scholar
  19. 19.
    Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: ICCV 2017—international conference on computer vision 2017. Venice, Italy.
  20. 20.
    Qi M, Li W, Yang Z, Wang Y, Luo J (2018) Attentive relational networks for mapping images to scene graphs. CoRR arXiv:1811.10696
  21. 21.
    Raposo D, Santoro A, Barrett DGT, Pascanu R, Lillicrap TP, Battaglia PW (2017) Discovering objects and their relations from entangled scene representations. CoRR arXiv:1702.05068
  22. 22.
    Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 2953–2961Google Scholar
  23. 23.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 91–99Google Scholar
  24. 24.
    Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959. CrossRefGoogle Scholar
  25. 25.
    Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, pp 4967–4976Google Scholar
  26. 26.
    Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879
  27. 27.
    Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. CoRR arXiv:1808.00191
  28. 28.
    Yang Z, He X, Gao J, Deng L, Smola AJ (2015) Stacked attention networks for image question answering. CoRR arXiv:1511.02274
  29. 29.
    Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. CoRR arXiv:1809.07041
  30. 30.
    Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal AM, Elhoseiny M (2018) Large-scale visual relationship understanding. CoRR arXiv:1804.10660
  31. 31.
    Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus R (2015) Simple baseline for visual question answering. CoRR arXiv:1512.02167

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  • Nicola Messina
    • 1
    Email author
  • Giuseppe Amato
    • 1
  • Fabio Carrara
    • 1
  • Fabrizio Falchi
    • 1
  • Claudio Gennaro
    • 1
  1. 1.PisaItaly

Personalised recommendations