Abstract
Content-based image retrieval is the process of retrieving a subset of images from an extensive image gallery based on visual contents, such as color, shape or spatial relations, and texture. In some applications, such as localization, image retrieval is employed as the initial step. In such cases, the accuracy of the top-retrieved images significantly affects the overall system accuracy. The current paper introduces a simple yet efficient image retrieval system with a fewer trainable parameters, which offers acceptable accuracy in top-retrieved images. The proposed method benefits from a dilated residual convolutional neural network with triplet loss. Experimental evaluations show that this model can extract richer information (i.e., high-resolution representations) by enlarging the receptive field, thus improving image retrieval accuracy without increasing the depth or complexity of the model. To enhance the extracted representations’ robustness, the current research obtains candidate regions of interest from each feature map and applies Generalized-Mean pooling to the regions. As the choice of triplets in a triplet-based network affects the model training, we employ a triplet online mining method. We test the performance of the proposed method under various configurations on two of the challenging image-retrieval datasets, namely Revisited Paris6k (RPar) and UKBench. The experimental results show an accuracy of 94.54 and 80.23 (mean precision at rank 10) in the RPar medium and hard modes and 3.86 (recall at rank 4) in the UKBench dataset, respectively.
Similar content being viewed by others
Notes
Scale Invariant Feature Transform.
Bag Of Words.
Vector Locally Aggregated Descriptor.
References
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic, J. (2016). Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp 5297–5307). https://doi.org/10.1109/cvpr.2016.572
Song, Y., Chen, X., Wang, X., Zhang, Y., & Li, J. (2016). 6-dof image localization from massive geo-tagged reference images. IEEE Transactions on Multimedia, 18(8), 1542–1554. https://doi.org/10.1109/tmm.2016.2568743
Nair, L. R., Subramaniam, K., and Prasannavenkatesan, G. (2020). A review on multiple approaches to medical image retrieval system. Intelligent Computing in Engineering: Select Proceedings of RICE 2019 (pp. 501–509). https://doi.org/10.1007/978-981-15-2780-7_55
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015). Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1116–1124). https://doi.org/10.1109/iccv.2015.133
Schonberger, J. L., Radenovic, F., Chum, O., and Frahm, J.-M. (2015). From single image query to detailed 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5126–5134). https://doi.org/10.1109/cvpr.2015.7299148
Chaudhuri, U., Banerjee, B., & Bhattacharya, A. (2019). Siamese graph convolutional network for content based remote sensing image retrieval. Computer vision and image understanding, 184, 22–30. https://doi.org/10.1016/j.cviu.2019.04.004
Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. (2016). Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1096–1104). https://doi.org/10.1109/cvpr.2016.124
Ali, S., Sullivan, J., Maki, A., and Carlsson, S. (2015). A baseline for visual instance retrieval with deep convolutional networks. In Proceedings of International Conference on Learning Representations.
Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C. G., & Bimbo, A. D. (2016). Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR), 49(1), 1–39. https://doi.org/10.1145/2906152
Babenko, A. and Lempitsky, V. (2015). Aggregating local deep features for image retrieval. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1269–1277). https://doi.org/10.1109/iccv.2015.150
Tolias, G., Sicre, R., and Jégou, H. (2016). Particular object retrieval with integral max-pooling of CNN activations. In ICLR 2016-International Conference on Learning Representations (pp. 1–12).
Sünderhauf, N., Shirazi, S., Dayoub, F., Upcroft, B., and Milford, M. (2015). On the performance of convnet features for place recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 4297–4304). IEEE. https://doi.org/10.1109/iros.2015.7353986
Jun, H., Ko, B., Kim, Y., Kim, I., and Kim, J. (2019). Combination of multiple global descriptors for image retrieval. Preprint retrieved from http://arxiv.org/abs/1903.10663
Peng, X., Zhang, X., Li, Y., & Liu, B. (2020). Research on image feature extraction and retrieval algorithms based on convolutional neural network. Journal of Visual Communication and Image Representation, 69, 102705. https://doi.org/10.1016/j.jvcir.2019.102705
Razavian, A. S., Sullivan, J., Carlsson, S., & Maki, A. (2016). Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications, 4(3), 251–258. https://doi.org/10.3169/mta.4.251
Chen, W., Liu, Y., Wang, W., Bakker, E., Georgiou, T., Fieguth, P., Liu, L., and Lew, M. S. (2021). Deep learning for instance retrieval: A survey. Preprint retrieved from http://arxiv.org/abs/2101.11282
Kalantidis, Y., Mellina, C., and Osindero, S. (2016). Cross-dimensional weighting for aggregated deep convolutional features. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I 14 (pp. 685–701). Springer. https://doi.org/10.1007/978-3-319-46604-0_48
Min, W., Mei, S., Li, Z., & Jiang, S. (2020). A two-stage triplet network training framework for image retrieval. IEEE Transactions on Multimedia, 22(12), 3128–3138. https://doi.org/10.1109/tmm.2020.2974326
Radenović, F., Tolias, G., & Chum, O. (2018). Fine-tuning CNN image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7), 1655–1668. https://doi.org/10.1109/tpami.2018.2846566
Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 815–823). https://doi.org/10.1109/cvpr.2015.7298682
Gordo, A., Almazán, J., Revaud, J., and Larlus, D. (2016). Deep image retrieval: Learning global representations for image search. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14 (pp. 241–257). Springer. https://doi.org/10.1007/978-3-319-46466-4_15
Hoffer, E. and Ailon, N. (2015). Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3 (pp. 84–92). Springer. https://doi.org/10.1007/978-3-319-24261-3_7
Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2840–2848). https://doi.org/10.1109/iccv.2017.309
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., and Wu, Y. (2014). Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1386–1393). https://doi.org/10.1109/cvpr.2014.180
Wang, L., Li, Y., and Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5005–5013). https://doi.org/10.1109/cvpr.2016.541
Cao, R., Zhang, Q., Zhu, J., Li, Q., Li, Q., Liu, B., & Qiu, G. (2020). Enhancing remote sensing image retrieval using a triplet deep metric learning network. International Journal of Remote Sensing, 41(2), 740–751. https://doi.org/10.1080/2150704x.2019.1647368
Gordo, A., Almazan, J., Revaud, J., & Larlus, D. (2017). End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2), 237–254. https://doi.org/10.1007/s11263-017-1016-8
Radenović, F., Tolias, G., and Chum, O. (2016). CNN image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 3–20). Springer. https://doi.org/10.1007/978-3-319-46448-0_1
Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. Preprint retrieved from http://arxiv.org/abs/1511.07122
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834–848. https://doi.org/10.1109/tpami.2017.2699184
Yu, F., Koltun, V., and Funkhouser, T. (2017a). Dilated residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 472–480). https://doi.org/10.1109/cvpr.2017.75
Sarlin, P.-E., Cadena, C., Siegwart, R., and Dymczyk, M. (2019). From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12716–12725). https://doi.org/10.1109/cvpr.2019.01300
Sarlin, P.-E., Debraine, F., Dymczyk, M., Siegwart, R., and Cadena, C. (2018). Leveraging deep visual descriptors for hierarchical efficient localization. In Conference on Robot Learning (pp. 456–465). PMLR.
Bai, C., Huang, L., Pan, X., Zheng, J., & Chen, S. (2018). Optimization of deep convolutional neural network for large scale image retrieval. Neurocomputing, 303, 60–67. https://doi.org/10.1016/j.neucom.2018.04.034
Babenko, A., Slesarev, A., Chigorin, A., and Lempitsky, V. (2014). Neural codes for image retrieval. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13 (pp. 584–599). Springer.
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster r-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/tpami.2016.2577031
Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4004–4012). https://doi.org/10.1109/cvpr.2016.434
Roy, S., Sangineto, E., Demir, B., & Sebe, N. (2020). Metric-learning-based deep hashing network for content-based retrieval of remote sensing images. IEEE Geoscience and Remote Sensing Letters, 18(2), 226–230. https://doi.org/10.1109/lgrs.2020.2974629
Ge, W. (2018). Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 269–285). https://doi.org/10.1007/978-3-030-01231-1_17
Xuan, H., Stylianou, A., and Pless, R. (2020). Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2474–2482). https://doi.org/10.1109/wacv45572.2020.9093432
Yu, B., Liu, T., Gong, M., Ding, C., and Tao, D. (2018). Correcting the triplet selection bias for triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 71–87). https://doi.org/10.1007/978-3-030-01231-1_5
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., and Moreno-Noguer, F. (2014). Fracking deep convolutional image descriptors. Preprint retrieved from http://arxiv.org/abs/1412.6537
Hermans, A., Beyer, L., and Leibe, B. (2017). In defense of the triplet loss for person re-identification. Preprint retrieved from http://arxiv.org/abs/1703.07737
Wang, Z., Li, Z., Sun, J., and Xu, Y. (2018). Selective convolutional features based generalized-mean pooling for fine-grained image retrieval. In 2018 IEEE Visual Communications and Image Processing (VCIP) (pp. 1–4). IEEE. https://doi.org/10.1109/vcip.2018.8698729
Cao, Y., Long, M., Wang, J., and Liu, S. (2017). Deep visual-semantic quantization for efficient image retrieval. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr.2017.104
Jin Kim, H., Dunn, E., and Frahm, J.-M. (2017). Learned contextual feature reweighting for image geo-localization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr.2017.346
Wang, H., Cai, Y., Zhang, Y., Pan, H., Lv, W., and Han, H. (2015). Deep learning for image retrieval: What works and what doesn’t. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW) (pp. 1576–1583). IEEE.
Radenovic, F., Iscen, A., Tolias, G., Avrithis, Y., and Chum, O. (2018). Revisiting oxford and paris: Large-scale image retrieval benchmarking. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. https://doi.org/10.1109/cvpr.2018.00598
Nister, D. and Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR’06). IEEE. https://doi.org/10.1109/cvpr.2006.264
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., and Killeen, T. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
Yu, W., Yang, K., Yao, H., Sun, X., & Xu, P. (2017). Exploiting the complementary strengths of multi-layer CNN features for image retrieval. Neurocomputing, 237, 235–241. https://doi.org/10.1016/j.neucom.2016.12.002
Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2016). Factors of transferability for a generic ConvNet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1790–1802. https://doi.org/10.1109/tpami.2015.2500224
Bianco, S., Cadene, R., Celona, L., & Napoletano, P. (2018). Benchmark analysis of representative deep neural network architectures. IEEE Access, 6, 64270–64277. https://doi.org/10.1109/access.2018.2877890
Canziani, A., Paszke, A., and Culurciello, E. (2016). An analysis of deep neural network models for practical applications. Preprint retrieved from http://arxiv.org/abs/1605.07678
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yousefzadeh, S., Pourreza, H. & Mahyar, H. A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval. J Sign Process Syst 95, 529–541 (2023). https://doi.org/10.1007/s11265-023-01865-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-023-01865-9