Skip to main content
Log in

A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Content-based image retrieval is the process of retrieving a subset of images from an extensive image gallery based on visual contents, such as color, shape or spatial relations, and texture. In some applications, such as localization, image retrieval is employed as the initial step. In such cases, the accuracy of the top-retrieved images significantly affects the overall system accuracy. The current paper introduces a simple yet efficient image retrieval system with a fewer trainable parameters, which offers acceptable accuracy in top-retrieved images. The proposed method benefits from a dilated residual convolutional neural network with triplet loss. Experimental evaluations show that this model can extract richer information (i.e., high-resolution representations) by enlarging the receptive field, thus improving image retrieval accuracy without increasing the depth or complexity of the model. To enhance the extracted representations’ robustness, the current research obtains candidate regions of interest from each feature map and applies Generalized-Mean pooling to the regions. As the choice of triplets in a triplet-based network affects the model training, we employ a triplet online mining method. We test the performance of the proposed method under various configurations on two of the challenging image-retrieval datasets, namely Revisited Paris6k (RPar) and UKBench. The experimental results show an accuracy of 94.54 and 80.23 (mean precision at rank 10) in the RPar medium and hard modes and 3.86 (recall at rank 4) in the UKBench dataset, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

Notes

  1. Scale Invariant Feature Transform.

  2. Bag Of Words.

  3. Vector Locally Aggregated Descriptor.

  4. https://www.flickr.com/

References

  1. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic, J. (2016). Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 5297–5307). https://doi.org/10.1109/cvpr.2016.572

  2. Song, Y., Chen, X., Wang, X., Zhang, Y., & Li, J. (2016). 6-dof image localization from massive geo-tagged reference images. IEEE Transactions on Multimedia, 18(8), 1542–1554. https://doi.org/10.1109/tmm.2016.2568743

    Article  Google Scholar 

  3. Nair, L. R., Subramaniam, K., and Prasannavenkatesan, G. (2020). A review on multiple approaches to medical image retrieval system. Intelligent Computing in Engineering: Select Proceedings of RICE 2019 (pp. 501–509). https://doi.org/10.1007/978-981-15-2780-7_55

  4. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015). Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1116–1124). https://doi.org/10.1109/iccv.2015.133

  5. Schonberger, J. L., Radenovic, F., Chum, O., and Frahm, J.-M. (2015). From single image query to detailed 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5126–5134). https://doi.org/10.1109/cvpr.2015.7299148

  6. Chaudhuri, U., Banerjee, B., & Bhattacharya, A. (2019). Siamese graph convolutional network for content based remote sensing image retrieval. Computer vision and image understanding, 184, 22–30. https://doi.org/10.1016/j.cviu.2019.04.004

    Article  Google Scholar 

  7. Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. (2016). Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1096–1104). https://doi.org/10.1109/cvpr.2016.124

  8. Ali, S., Sullivan, J., Maki, A., and Carlsson, S. (2015). A baseline for visual instance retrieval with deep convolutional networks. In Proceedings of International Conference on Learning Representations.

  9. Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C. G., & Bimbo, A. D. (2016). Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR), 49(1), 1–39. https://doi.org/10.1145/2906152

    Article  Google Scholar 

  10. Babenko, A. and Lempitsky, V. (2015). Aggregating local deep features for image retrieval. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1269–1277). https://doi.org/10.1109/iccv.2015.150

  11. Tolias, G., Sicre, R., and Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. In ICLR 2016-International Conference on Learning Representations (pp. 1–12).

  12. Sünderhauf, N., Shirazi, S., Dayoub, F., Upcroft, B., and Milford, M. (2015). On the performance of convnet features for place recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4297–4304). IEEE. https://doi.org/10.1109/iros.2015.7353986

  13. Jun, H., Ko, B., Kim, Y., Kim, I., and Kim, J. (2019). Combination of multiple global descriptors for image retrieval. Preprint retrieved from http://arxiv.org/abs/1903.10663

  14. Peng, X., Zhang, X., Li, Y., & Liu, B. (2020). Research on image feature extraction and retrieval algorithms based on convolutional neural network. Journal of Visual Communication and Image Representation, 69, 102705. https://doi.org/10.1016/j.jvcir.2019.102705

    Article  Google Scholar 

  15. Razavian, A. S., Sullivan, J., Carlsson, S., & Maki, A. (2016). Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications, 4(3), 251–258. https://doi.org/10.3169/mta.4.251

    Article  Google Scholar 

  16. Chen, W., Liu, Y., Wang, W., Bakker, E., Georgiou, T., Fieguth, P., Liu, L., and Lew, M. S. (2021). Deep learning for instance retrieval: A survey. Preprint retrieved from http://arxiv.org/abs/2101.11282

  17. Kalantidis, Y., Mellina, C., and Osindero, S. (2016). Cross-dimensional weighting for aggregated deep convolutional features. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I 14 (pp. 685–701). Springer. https://doi.org/10.1007/978-3-319-46604-0_48

  18. Min, W., Mei, S., Li, Z., & Jiang, S. (2020). A two-stage triplet network training framework for image retrieval. IEEE Transactions on Multimedia, 22(12), 3128–3138. https://doi.org/10.1109/tmm.2020.2974326

    Article  Google Scholar 

  19. Radenović, F., Tolias, G., & Chum, O. (2018). Fine-tuning CNN image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7), 1655–1668. https://doi.org/10.1109/tpami.2018.2846566

    Article  Google Scholar 

  20. Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 815–823). https://doi.org/10.1109/cvpr.2015.7298682

  21. Gordo, A., Almazán, J., Revaud, J., and Larlus, D. (2016). Deep image retrieval: Learning global representations for image search. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14 (pp. 241–257). Springer. https://doi.org/10.1007/978-3-319-46466-4_15

  22. Hoffer, E. and Ailon, N. (2015). Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3 (pp. 84–92). Springer. https://doi.org/10.1007/978-3-319-24261-3_7

  23. Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2840–2848). https://doi.org/10.1109/iccv.2017.309

  24. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., and Wu, Y. (2014). Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1386–1393). https://doi.org/10.1109/cvpr.2014.180

  25. Wang, L., Li, Y., and Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5005–5013). https://doi.org/10.1109/cvpr.2016.541

  26. Cao, R., Zhang, Q., Zhu, J., Li, Q., Li, Q., Liu, B., & Qiu, G. (2020). Enhancing remote sensing image retrieval using a triplet deep metric learning network. International Journal of Remote Sensing, 41(2), 740–751. https://doi.org/10.1080/2150704x.2019.1647368

    Article  Google Scholar 

  27. Gordo, A., Almazan, J., Revaud, J., & Larlus, D. (2017). End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2), 237–254. https://doi.org/10.1007/s11263-017-1016-8

    Article  MathSciNet  Google Scholar 

  28. Radenović, F., Tolias, G., and Chum, O. (2016). CNN image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 3–20). Springer. https://doi.org/10.1007/978-3-319-46448-0_1

  29. Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. Preprint retrieved from http://arxiv.org/abs/1511.07122

  30. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834–848. https://doi.org/10.1109/tpami.2017.2699184

    Article  Google Scholar 

  31. Yu, F., Koltun, V., and Funkhouser, T. (2017a). Dilated residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 472–480). https://doi.org/10.1109/cvpr.2017.75

  32. Sarlin, P.-E., Cadena, C., Siegwart, R., and Dymczyk, M. (2019). From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12716–12725). https://doi.org/10.1109/cvpr.2019.01300

  33. Sarlin, P.-E., Debraine, F., Dymczyk, M., Siegwart, R., and Cadena, C. (2018). Leveraging deep visual descriptors for hierarchical efficient localization. In Conference on Robot Learning (pp. 456–465). PMLR.

  34. Bai, C., Huang, L., Pan, X., Zheng, J., & Chen, S. (2018). Optimization of deep convolutional neural network for large scale image retrieval. Neurocomputing, 303, 60–67. https://doi.org/10.1016/j.neucom.2018.04.034

    Article  Google Scholar 

  35. Babenko, A., Slesarev, A., Chigorin, A., and Lempitsky, V. (2014). Neural codes for image retrieval. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 (pp. 584–599). Springer.

  36. Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/tpami.2016.2577031

    Article  Google Scholar 

  37. Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4004–4012). https://doi.org/10.1109/cvpr.2016.434

  38. Roy, S., Sangineto, E., Demir, B., & Sebe, N. (2020). Metric-learning-based deep hashing network for content-based retrieval of remote sensing images. IEEE Geoscience and Remote Sensing Letters, 18(2), 226–230. https://doi.org/10.1109/lgrs.2020.2974629

    Article  Google Scholar 

  39. Ge, W. (2018). Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 269–285). https://doi.org/10.1007/978-3-030-01231-1_17

  40. Xuan, H., Stylianou, A., and Pless, R. (2020). Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2474–2482). https://doi.org/10.1109/wacv45572.2020.9093432

  41. Yu, B., Liu, T., Gong, M., Ding, C., and Tao, D. (2018). Correcting the triplet selection bias for triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 71–87). https://doi.org/10.1007/978-3-030-01231-1_5

  42. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., and Moreno-Noguer, F. (2014). Fracking deep convolutional image descriptors. Preprint retrieved from http://arxiv.org/abs/1412.6537

  43. Hermans, A., Beyer, L., and Leibe, B. (2017). In defense of the triplet loss for person re-identification. Preprint retrieved from http://arxiv.org/abs/1703.07737

  44. Wang, Z., Li, Z., Sun, J., and Xu, Y. (2018). Selective convolutional features based generalized-mean pooling for fine-grained image retrieval. In 2018 IEEE Visual Communications and Image Processing (VCIP) (pp. 1–4). IEEE. https://doi.org/10.1109/vcip.2018.8698729

  45. Cao, Y., Long, M., Wang, J., and Liu, S. (2017). Deep visual-semantic quantization for efficient image retrieval. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr.2017.104

  46. Jin Kim, H., Dunn, E., and Frahm, J.-M. (2017). Learned contextual feature reweighting for image geo-localization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr.2017.346

  47. Wang, H., Cai, Y., Zhang, Y., Pan, H., Lv, W., and Han, H. (2015). Deep learning for image retrieval: What works and what doesn’t. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW) (pp. 1576–1583). IEEE.

  48. Radenovic, F., Iscen, A., Tolias, G., Avrithis, Y., and Chum, O. (2018). Revisiting oxford and paris: Large-scale image retrieval benchmarking. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. https://doi.org/10.1109/cvpr.2018.00598

  49. Nister, D. and Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR’06). IEEE. https://doi.org/10.1109/cvpr.2006.264

  50. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., and Killeen, T. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.

  51. Yu, W., Yang, K., Yao, H., Sun, X., & Xu, P. (2017). Exploiting the complementary strengths of multi-layer CNN features for image retrieval. Neurocomputing, 237, 235–241. https://doi.org/10.1016/j.neucom.2016.12.002

    Article  Google Scholar 

  52. Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2016). Factors of transferability for a generic ConvNet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1790–1802. https://doi.org/10.1109/tpami.2015.2500224

    Article  Google Scholar 

  53. Bianco, S., Cadene, R., Celona, L., & Napoletano, P. (2018). Benchmark analysis of representative deep neural network architectures. IEEE Access, 6, 64270–64277. https://doi.org/10.1109/access.2018.2877890

    Article  Google Scholar 

  54. Canziani, A., Paszke, A., and Culurciello, E. (2016). An analysis of deep neural network models for practical applications. Preprint retrieved from http://arxiv.org/abs/1605.07678

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamidreza Pourreza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yousefzadeh, S., Pourreza, H. & Mahyar, H. A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval. J Sign Process Syst 95, 529–541 (2023). https://doi.org/10.1007/s11265-023-01865-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-023-01865-9

Keywords

Navigation