Fusion of Global and Local Deep Representation for Effective Object Retrieval

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 768)


Recently, Regional Max Aggregating Convolutional (R-MAC) feature built upon Convolutional Neural Networks (CNNs) has shown great feature representation power in object retrieval. Combing with re-ranking step by object localization and post-processing step by query expansion, R-MAC can achieve a good retrieval accuracy. However, we have found that performing retrieval with R-MAC feature from local query object will result in a lost of global information. In this paper, we propose to fuse global and local R-MAC feature to improve retrieval accuracy. Specifically, we use the global R-MAC deriving from entire image to issue the initial query, which can rank more positive results in the top. Then, the global and local R-MAC are concatenated to represent the entire image for re-ranking and query expansion, which can be a more comprehensive descriptor for object retrieval and avoid possible failure in object localization step. In addition, the concatenation is performed on the fly, needing no extra saving space. Experimental results on the public Oxford5k, Paris6k, Oxford105k and Paris106k dataset demonstrate that the proposed approach can improve retrieval accuracy with negligible computation and memory cost.


Objec retrieval Feature fusion Deep feature CNN 



The authors would like to thank the financial support of National Natural Science Foundation of China (Project NO. 61672528, 61403405, 61232016, 61170287).


  1. 1.
    Sivic, J., Zisserman, A.: Video Google: efficient visual search of videos. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, pp. 127–144. Springer, Heidelberg (2006). doi: 10.1007/11957959_7 CrossRefGoogle Scholar
  2. 2.
    Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)CrossRefGoogle Scholar
  3. 3.
    Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)CrossRefGoogle Scholar
  4. 4.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition (2007)Google Scholar
  5. 5.
    Zhou, Z., Wang, Y., Wu, Q.J., Yang, C.N., Sun, X.: Effective and efficient global context verification for image copy detection. IEEE Trans. Inf. Forensics Secur. 12(1), 48–63 (2017). doi: 10.1109/TIFS.2016.2601065. CrossRefGoogle Scholar
  6. 6.
    Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall II: query expansion revisited. In: Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2011, pp. 889–896. IEEE Computer Society, Los Alamitos (2011). doi: 10.1109/CVPR.2011.5995601. CD-ROM
  7. 7.
    Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: automatic query expansion with a generative feature model for object retrieval. In: IEEE International Conference on Computer Vision (2007)Google Scholar
  8. 8.
    Zhang, S., Yang, M., Cour, T., Yu, K., Metaxas, D.: Query specific rank fusion for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 37(4), 803–815 (2015)CrossRefGoogle Scholar
  9. 9.
    Zheng, L., Wang, S., Liu, Z., Tian, Q.: Packing and padding: coupled multi-index for accurate image retrieval. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1947–1954. IEEE (2014)Google Scholar
  10. 10.
    Zheng, L., Wang, S., Tian, L., He, F., Liu, Z., Tian, Q.: Query-adaptive late fusion for image search and person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  11. 11.
    Yan, K., Wang, Y., Liang, D., Huang, T., Tian, Y.: CNN vs. SIFT for image retrieval: alternative or complementary? In: Proceedings of the 2016 ACM on Multimedia Conference, MM 2016, pp. 407–411. ACM, New York (2016). doi: 10.1145/2964284.2967252
  12. 12.
    Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014). doi: 10.1109/CVPRW.2014.131
  13. 13.
    Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: The IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  14. 14.
    Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
  15. 15.
    Jégou, H., Perronnin, F., Douze, M., Snchez, J., Prez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2012). doi: 10.1109/TPAMI.2011.235 CrossRefGoogle Scholar
  16. 16.
    Iscen, A., Tolias, G., Avrithis, Y.S., Furon, T., Chum, O.: Efficient diffusion on region manifolds: recovering small objects with compact CNN representations. CoRR abs/1611.05113 (2016).
  17. 17.
    Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). doi: 10.1007/978-3-319-46448-0_1 CrossRefGoogle Scholar
  18. 18.
    Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). doi: 10.1007/978-3-319-46466-4_15 CrossRefGoogle Scholar
  19. 19.
    Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  20. 20.
    van de Sande, K.E.A., Uijlings, J.R.R., Gevers, T., Smeulders, A.W.M.: Segmentation as selective search for object recognition. In: 2011 International Conference on Computer Vision, pp. 1879–1886 (2011). doi: 10.1109/ICCV.2011.6126456
  21. 21.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). doi: 10.1007/978-3-319-10602-1_26 Google Scholar
  22. 22.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation f the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001). doi: 10.1023/A:1011139631724 CrossRefMATHGoogle Scholar
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012).
  24. 24.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  25. 25.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: IEEE Conference on Computer Vision and Pattern Recognition (2008)Google Scholar
  26. 26.
    Iscen, A., Rabbat, M., Furon, T.: Efficient large-scale similarity search using matrix factorization. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  27. 27.
    Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 685–701. Springer, Cham (2016). doi: 10.1007/978-3-319-46604-0_48 CrossRefGoogle Scholar
  28. 28.
    Mikulík, A., Perdoch, M., Chum, O., Matas, J.: Learning a fine vocabulary. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6313, pp. 1–14. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15558-1_1 CrossRefGoogle Scholar
  29. 29.
    Tolias, G., Jégou, H.: Visual query expansion with or without geometry: refining local descriptors by feature aggregation. Pattern Recogn. 47, 3466–3476 (2014). CrossRefGoogle Scholar
  30. 30.
    Salvador, A., Giro-i-Nieto, X., Marques, F., Satoh, S.: Faster R-CNN features for instance search. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2016)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  • Mao Wang
    • 1
  • Yuewei Ming
    • 1
  • Qiang Liu
    • 1
  • Jianping Yin
    • 2
  1. 1.College of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.State Key Laboratory of High Performance ComputingNational University of Defense TechnologyChangshaChina

Personalised recommendations