International Journal of Computer Vision

, Volume 121, Issue 1, pp 149–168 | Cite as

Convolutional Patch Representations for Image Retrieval: An Unsupervised Approach

  • Mattis Paulin
  • Julien Mairal
  • Matthijs Douze
  • Zaid Harchaoui
  • Florent Perronnin
  • Cordelia Schmid
Article

Abstract

Convolutional neural networks (CNNs) are able to model local stationary structures in natural images in a multi-scale fashion, when learning all model parameters with supervision. While excellent performance was achieved for image classification when large amounts of labeled visual data are available, their success for unsupervised tasks such as image retrieval has been moderate so far.Our paper focuses on this latter setting and explores several methods for learning patch descriptors without supervision with application to matching and instance-level retrieval. To that effect, we propose a new family of patch representations, based on the recently introduced convolutional kernel networks. We show that our descriptor, named Patch-CKN, performs better than SIFT as well as other convolutional networks learned by artificially introducing supervision and is significantly faster to train. To demonstrate its effectiveness, we perform an extensive evaluation on standard benchmarks for patch and image retrieval where we obtain state-of-the-art results. We also introduce a new dataset called RomePatches, which allows to simultaneously study descriptor performance for patch and image retrieval.

Keywords

Low-level image description Instance-level retrieval Convolutional Neural Networks 

Notes

Acknowledgments

This work was partially supported by projects “Allegro” (ERC), “Titan” (CNRS-Mastodons), “Macaron” (ANR-14-CE23-0003-01), CIFAR, and a Xerox Research Center Europe collaboration contract. We wish to thank Fischer et al. (2014), Babenko et al. (2014), and Gong et al. (2014), Hervé Jégou, Ben Recht, for their helpful discussions and comments. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

References

  1. Agrawal, P., Carreira, J., Malik, J. (2015) Learning to see by moving. In IEEE conference on computer vision and pattern recognition Google Scholar
  2. Arandjelovic, R., Zisserman, A. (2013) All about VLAD. In IEEE conference on computer vision and pattern recognition Google Scholar
  3. Babenko, A., Lempitsky, V. (2015) Aggregating deep convolutional features for image retrieval. In International conference on computer vision Google Scholar
  4. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V. (2014) Neural codes for image retrieval. In European conference on computer vision Google Scholar
  5. Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.MathSciNetMATHGoogle Scholar
  6. Bay, H., Tuytelaars, T., Van Gool, L. (2006) SURF: Speeded up robust features. In European conference on computer vision Google Scholar
  7. Bo, L., Ren, X., Fox, D. (2010) Kernel descriptors for visual recognition. Advances in neural information processing systems Google Scholar
  8. Bottou, L. (2012). Stochastic gradient descent tricks. Neural networks: Tricks of the trade. Berlin: Springer.Google Scholar
  9. Brown, M., Hua, G., & Winder, S. (2011). Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 43–57.CrossRefGoogle Scholar
  10. Calonder, M., Lepetit, V., Strecha, C., Fua (2010). BRIEF: Binary robust independent elementary features. In European conference on computer vision Google Scholar
  11. Chopra, S., Hadsell, R., LeCun, Y. (2005) Learning a similarity metric discriminatively, with application to face verification. In IEEE conference on computer vision and pattern recognition Google Scholar
  12. Coates, A., & Ng, A. Y. (2012). Learning feature representations with k-mean. Neural networks: Tricks of the trade. Heidelberg: Springer.Google Scholar
  13. Cucker, F., & Zhou, D. X. (2007). Learning theory : An approximation theory viewpoint., Cambridge Monographs on Applied and Computational Mathematics Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
  14. Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., Fei-Fei, L. (2009) ImageNet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition Google Scholar
  15. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T. (2014) DeCAF: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning Google Scholar
  16. Dong, J., Soatto, S. (2015) Domain-size pooling in local descriptors: Dsp-sift. In IEEE conference on computer vision and pattern recognition Google Scholar
  17. Dosovitskiy, A., Springenberg, JT., Riedmiller, M., Brox, T. (2014) Discriminative unsupervised feature learning with convolutional neural networks. Advances in Neural Information Processing Systems Google Scholar
  18. Erhan, D., Manzagol, PA., Bengio, Y., Bengio, S., Vincent, P. (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In Twelfth international conference on artificial intelligence and statistics Google Scholar
  19. Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11, 625–660.MathSciNetMATHGoogle Scholar
  20. Fischer, P., Dosovitskiy, A., Brox, T. (2014) Descriptor matching with Convolutional Neural Networks: a comparison to SIFT. arXiv PreprintGoogle Scholar
  21. Gong, Y., Wang, L., Guo, R., Lazebnik, S. (2014) Multi-scale orderless pooling of deep convolutional activation features. In European conference on computer vision Google Scholar
  22. Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y. (2014) Unsupervised feature learning from temporal data. In Advances in Neural Information Processing Systems Google Scholar
  23. Goroshin, R., Mathieu, M., LeCun, Y. (2015) Learning to linearize under uncertainty. In Advances in Neural Information Processing Systems Google Scholar
  24. Jayaraman, D., Grauman, K. (2015) Learning image representations equivariant to ego-motion. In IEEE conference on computer vision and pattern recognition Google Scholar
  25. Jégou, H., Chum, O. (2012) Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening. In European conference on computer vision Google Scholar
  26. Jégou, H., Douze, M., Schmid, C. (2008) Hamming embedding and weak geometric consistency for large scale image search. In European conference on computer vision Google Scholar
  27. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In IEEE conference on computer vision and pattern recognition Google Scholar
  28. Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117–128.CrossRefGoogle Scholar
  29. Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.CrossRefGoogle Scholar
  30. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014) Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference Google Scholar
  31. Jiang, W., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y. (2014) Learning fine-grained image similarity with deep ranking. In IEEE conference on computer vision and pattern recognition Google Scholar
  32. Krizhevsky, A., Sutskever, I., Hinton, G. (2012) ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems Google Scholar
  33. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L. (1989) Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems Google Scholar
  34. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRefGoogle Scholar
  35. Li, Y., Snavely, N., Huttenlocher, DP. (2010) Location recognition using prioritized feature matching. In European conference on computer vision Google Scholar
  36. Long, J., Zhang, N., Darrell, T. (2014) Do Convnets learn correspondances? Advances in Neural Information Processing Systems Google Scholar
  37. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  38. Mairal, J., Bach, F., Ponce, J. (2014a) Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision Google Scholar
  39. Mairal, J., Koniusz, P., Harchaoui, Z., Schmid C (2014b) Convolutional kernel networks. Advances in Neural Information Processing Systems Google Scholar
  40. Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal on Computer Vision, 60(1), 63–86.CrossRefGoogle Scholar
  41. Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.CrossRefGoogle Scholar
  42. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., et al. (2005). A comparison of affine region detectors. International Journal on Computer Vision, 65, 43–72.CrossRefGoogle Scholar
  43. Ng, JYH., Yang, F., Davis, LS. (2015) Exploiting Local Features from Deep Networks for Image Retrieval. In DeepVision workshop Google Scholar
  44. Nister, D., Stewenius, H. (2006) Scalable recognition with a vocabulary tree. In IEEE conference on computer vision and pattern recognition Google Scholar
  45. Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronnin, F., Schmid, C. (2015) Local convolutional features with unsupervised training for image retrieval. In International conference on computer vision Google Scholar
  46. Perd’och, M., Chum, O., Matas, J. (2009) Efficient representation of local geometry for large scale object retrieval. In IEEE conference on computer vision and pattern recognition Google Scholar
  47. Perronnin, F., Dance, C. (2007) Fisher kernels on visual vocabularies for image categorization. In IEEE conference on computer vision and pattern recognition Google Scholar
  48. Perronnin, F., Sánchez, J., Liu, Y. (2010) Large-scale image categorization with explicit data embedding. In IEEE conference on computer vision and pattern recognition Google Scholar
  49. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A. (2007) Object retrieval with large vocabularies and fast spatial matching. In IEEE conference on computer vision and pattern recognition Google Scholar
  50. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A. (2008) Lost in quantization: Improving particular object retrieval in large scale image databases. In IEEE conference on computer vision and pattern recognition Google Scholar
  51. Philbin, J., Isard, M., Sivic, J., Zisserman, A. (2010) Descriptor learning for efficient retrieval. In European conference on computer vision Google Scholar
  52. Rahimi, A., Recht, B. (2008) Random features for large-scale kernel machines. Advances in Neural Information Processing Systems Google Scholar
  53. Razavian, AS., Azizpour, H., Sullivan, J., Carlsson, S. (2014) CNN features off-the-shelf: an astounding baseline for recognition. preprint arXiv:1403.6382
  54. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT press.Google Scholar
  55. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Moreno-Noguer, F. (2015) Discriminative learning of deep convolutional feature point descriptors. In International conference on computer vision Google Scholar
  56. Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Learning local feature descriptors using convex optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 1573–1585.CrossRefGoogle Scholar
  57. Tola, E., Lepetit, V., & Fua, P. (2010). Daisy: An efficient dense descriptor applied to wide-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 815–830.Google Scholar
  58. Tolias, G., Sicre, R., Jégou H (2015) Particular Object Retrieval with Integral Max-Pooling of CNN Activations. preprint arXiv:1511.05879
  59. Tuytelaars, T., & Mikolajczyk, K. (2008). Local invariant feature detectors: A survey. Foundations and Trends in Computer Graphics and Vision, 3(3), 177–280.Google Scholar
  60. Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492.CrossRefGoogle Scholar
  61. Wang, Z., Fan, B., Wu, F. (2011) Local intensity order pattern for feature description. In International conference on computer vision Google Scholar
  62. Williams, C., Seeger, M. (2001) Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems Google Scholar
  63. Winder, S., Hua, G., Brown, M. (2009) Picking the best Daisy. In IEEE conference on computer vision and pattern recognition Google Scholar
  64. Yosinski, J., Clune, J., Bengio, Y., Lipson, H. (2014) How transferable are features in deep neural networks? Advances in Neural Information Processing Systems Google Scholar
  65. Zagoruyko, S., Komodakis, N. (2015) Learning to compare image patches via convolutional neural networks. In IEEE conference on computer vision and pattern recognition Google Scholar
  66. Zbontar, J., LeCun, Y. (2015) Computing the stereo matching cost with a convolutional neural network. In IEEE conference on computer vision and pattern recognition Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Thoth team, Laboratoire Jean KuntzmannInria Grenoble Rhone-AlpesMontbonnot-Saint-MartinFrance
  2. 2.Facebook AI ResearchMenlo ParkUSA
  3. 3.New York UniversityNew YorkUSA

Personalised recommendations