International Journal of Computer Vision

, Volume 120, Issue 1, pp 1–13 | Cite as

Accurate Image Search with Multi-Scale Contextual Evidences

Article

Abstract

This paper considers the task of image search using the Bag-of-Words (BoW) model. In this model, the precision of visual matching plays a critical role. Conventionally, local cues of a keypoint, e.g., SIFT, are employed. However, such strategy does not consider the contextual evidences of a keypoint, a problem which would lead to the prevalence of false matches. To address this problem and enable accurate visual matching, this paper proposes to integrate discriminative cues from multiple contextual levels, i.e., local, regional, and global, via probabilistic analysis. “True match” is defined as a pair of keypoints corresponding to the same scene location on all three levels (Fig. 1). Specifically, the Convolutional Neural Network (CNN) is employed to extract features from regional and global patches. We show that CNN feature is complementary to SIFT due to its semantic awareness and compares favorably to several other descriptors such as GIST, HSV, etc. To reduce memory usage, we propose to index CNN features outside the inverted file, communicated by memory-efficient pointers. Experiments on three benchmark datasets demonstrate that our method greatly promotes the search accuracy when CNN feature is integrated. We show that our method is efficient in terms of time cost compared with the BoW baseline, and yields competitive accuracy with the state-of-the-arts.

Keywords

Image search BoW model Convolutional neural network Contextual evidences 

Notes

Acknowledgments

This work was supported by the National High Technology Research and Development Program of China (863 program) under Grant No. 2012AA011004 and the National Science and Technology Support Program under Grant No. 2013BAK02B04. This work was supported in part to Dr. Qi Tian by ARO grants W911NF-15-1-0290 and Faculty Research Gift Awards by NEC Laboratories of America and Blippar. This work was supported in part by National Science Foundation of China (NSFC) 61429201.

References

  1. Arandjelović, R., & Zisserman, A. (2011). Smooth object retrieval using a bag of boundaries. In Computer Vision, IEEE International Conference on (pp. 375–382).Google Scholar
  2. Arandjelović, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2911–2918).Google Scholar
  3. Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. arXiv preprint arXiv:1404.1777.
  4. Carson, C., Thomas, M., Belongie, S., Hellerstein, J. M., & Malik, J. (1999). Blobworld: A system for region-based image indexing and retrieval. In Visual Information and Information Systems (pp. 509–517).Google Scholar
  5. Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th annual ACM symposium on theory of computing (pp. 380–388). ACM.Google Scholar
  6. Chen, Y., & Wang, J. Z. (2002). A region-based fuzzy feature matching approach to content-based image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1252–1267.CrossRefGoogle Scholar
  7. Chum, O., Philbin, J., Sivic, J., Isard, M., & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In IEEE International Conference on Computer Vision (pp. 1–8).Google Scholar
  8. Deng, C., Ji, R., Liu, W., Tao, D., & Gao, X. (2013). Visual reranking through weakly supervised multi-graph learning. In IEEE International Conference on Computer Vision (pp. 2600–2607).Google Scholar
  9. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.
  10. Fang, Q., Sang, J., & Xu, C. (2013). Giant: Geo-informative attributes for location recognition and exploration. In Proceedings of the 21st ACM international conference on Multimedia (pp. 13–22).Google Scholar
  11. Ge, T., Ke, Q., & Sun, J. (2013). Sparse-coded features for image retrieval. In British Machine Vision Conference, vol. 6.Google Scholar
  12. Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision (pp. 392–407). Springer.Google Scholar
  13. Huiskes, M. J., Thomee, B., & Lew, M. S. (2010). New trends and ideas in visual concept detection: The mir flickr retrieval evaluation initiative. In Proceedings of the international conference on multimedia information retrieval (pp. 527–536). ACM.Google Scholar
  14. Jegou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In European Conference on Computer Vision (pp. 304–317). Springer.Google Scholar
  15. Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1169–1176).Google Scholar
  16. Jégou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features for large scale image search. International Journal of Computer Vision, 87(3), 316–336.CrossRefGoogle Scholar
  17. Jégou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.CrossRefGoogle Scholar
  18. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE Conference on Computer Vision and Pattern Recognition (vol. 2, pp. 2169–2178).Google Scholar
  19. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
  20. LeCun, Y., Huang, F. J., & Bottou, L. (2004). Learning methods for generic object recognition with invariance to pose and lighting. In CVPR (pp. 90–97).Google Scholar
  21. Li, X., Larson, M., & Hanjalic, A. (2015). Pairwise geometric matching for large-scale object retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  22. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  23. Manjunath, B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837–842.CrossRefGoogle Scholar
  24. Manjunath, B. S., Ohm, J. R., Vasudevan, V. V., & Yamada, A. (2001). Color and texture descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6), 703–715.CrossRefGoogle Scholar
  25. Niester, D., & Stewenius, H. (2006). Scalable recognition with a vocabulary tree. In CVPR.Google Scholar
  26. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.CrossRefMATHGoogle Scholar
  27. Perd’och, M., Chum, O., & Matas, J. (2009). Efficient representation of local geometry for large scale object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 9–16).Google Scholar
  28. Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed fisher vectors. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3384–3391).Google Scholar
  29. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).Google Scholar
  30. Qin, D., Wengert, C., & Van Gool, L. (2013). Query adaptive similarity for large scale object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1610–1617).Google Scholar
  31. Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014) Cnn features off-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 512–519).Google Scholar
  32. Shen, X., Lin, Z., Brandt, J., Avidan, S., & Wu, Y. (2012). Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In CVPR.Google Scholar
  33. Shi, M., Avrithis, Y., & Jégou, H. (2015). Early burst detection for memory-efficient image retrieval. In Computer Vision and Pattern Recognition.Google Scholar
  34. Souvannavong, F., Merialdo, B., & Huet, B. (2005). Region-based video content indexing and retrieval. In International Workshop on Content-Based Multimedia Indexing. Citeseer.Google Scholar
  35. Tao, R., Gavves, E., Snoek, C. G., & Smeulders, A. W. (2014). Locality in generic instance search from one example. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2014 (pp. 2099–2106). IEEE.Google Scholar
  36. Tolias, G., Avrithis, Y., Jégou, H. (2013). To aggregate or not to aggregate: Selective match kernels for image search. In IEEE International Conference on Computer Vision (pp. 1401–1408).Google Scholar
  37. Van Gemert, J. C., Veenman, C. J., Smeulders, A. W., & Geusebroek, J. M. (2010). Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7), 1271–1283.Google Scholar
  38. Wang, J., Shen, H. T., Song, J., & Ji, J. (2014). Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927.
  39. Wengert, C., Douze, M., & Jégou, H. (2011). Bag-of-colors for improved image search. In ACM international conference on Multimedia (pp. 1437–1440).Google Scholar
  40. White, T. (2012). Hadoop: The definitive guide. Sebastopol, CA: O’Reilly Media, Inc.Google Scholar
  41. Wu, J., & Rehg, J. M. (2011). Centrist: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.CrossRefGoogle Scholar
  42. Xie, L., Tian, Q., Hong, R., & Zhang, B. (2015). Image classification and retrieval are one. In International Conference on Multimedia Retrieval.Google Scholar
  43. Zhang, H., Zha, Z. J., Yang, Y., Yan, S., Gao, Y., & Chua, T. S. (2013). Attribute-augmented semantic hierarchy: Towards bridging semantic gap and intention gap in image retrieval. In ACM international conference on Multimedia (pp. 33–42).Google Scholar
  44. Zhang, S., Yang, M., Cour, T., Yu, K., & Metaxas, D. N. (2012). Query specific fusion for image retrieval. In European Conference on Computer Vision (pp. 660–673). Springer.Google Scholar
  45. Zhang, S., Yang, M., Wang, X., Lin, Y., & Tian, Q. (2013). Semantic-aware co-indexing for near-duplicate image retrieval. In IEEE International Conference on Computer Vision (pp. 1673–1680).Google Scholar
  46. Zheng, L., Wang, S., Liu, Z., & Tian, Q. (2014). Packing and padding: Coupled multi-index for accurate image retrieval. In CVPR 2014.Google Scholar
  47. Zheng, L., Wang, S., Tian, L., He, F., Liu, Z., & Tian, Q. (2015). Query-adaptive late fusion for image search and person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1741–1750).Google Scholar
  48. Zheng, L., Wang, S., Zhou, W., & Tian, Q. (2014). Bayes merging of multiple vocabularies for scalable image retrieval. In CVPR.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Electronic EngineeringTsinghua UniversityBeijingChina
  2. 2.Media Computing GroupMicrosoft Research AsiaBeijingChina
  3. 3.University of TexasSan AntonioUSA

Personalised recommendations