Advertisement

Autonomous Robots

, Volume 42, Issue 6, pp 1169–1185 | Cite as

BoCNF: efficient image matching with Bag of ConvNet features for scalable and robust visual place recognition

Article

Abstract

Recent advances in visual place recognition (VPR) have exploited ConvNet features to improve the recognition accuracy under significant environmental and viewpoint changes. However, it remains unsolved how to implement efficient image matching with high dimensional ConvNet features. In this paper, we tackle the problem of matching efficiency using ConvNet features for VPR, where the task is to accurately and quickly recognize a given place in large-scale challenging environments. The paper makes two contributions. First, we propose an efficient solution to VPR, based on the well-known bag-of-words (BoW) framework, to speed up image matching with ConvNet features. Second, in order to alleviate the problem of perceptual aliasing in BoW, we adopt a coarse-to-fine approach where we first, in the coarse stage, search for the top-K candidate images via BoW and then, in the fine stage, identify the best match among the candidates using a hash-based voting scheme. We conduct extensive experiments on six popular VPR datasets to validate the effectiveness of our method. Experimental results show that, in terms of recognition accuracy, our method is comparable to linear search, and outperforms other methods such as FABMAP and SeqSLAM by a significant margin. In terms of efficiecy, our method achieves a significant speed-up over linear search, with an average matching time as low as 23.5 ms per query on a dataset with 21K images.

Keywords

Visual place recognition Bag of words ConvNet feature Image matching 

Notes

Acknowledgements

We appreciate the helpful comments from reviewers. We also gratefully acknowledge the support from the Hunan Provincial Innovation Foundation for Postgraduate (CX2014B021), the Hunan Provincial Natural Science Foundation of China (2015JJ3018) and the China Scholarship Council. This research is also supported in part by the Program of Foshan Innovation Team (Grant No. 2015IT100072) and by NSFC (Grant No. 61673125).

References

  1. Arandjelovic, R., & Zisserman, A. (2013). All about VLAD. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1578–1585).Google Scholar
  2. Babenko, A., & Lempitsky, V. (2015). Aggregating deep convolutional features for image retrieval. In IEEE international conference on computer vision (ICCV).Google Scholar
  3. Badino, H., Huber, D., & Kanade T. (2011). The CMU visual localization data set. http://3dvis.ri.cmu.edu/data-sets/localization.
  4. Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. In European conference on computer vision (ECCV) (Vol. 3951, pp. 404–417).Google Scholar
  5. Chen, Z., Lam, O., Jacobson, A., & M. Milford (2014). Convolutional neural network-based place recognition. In Australasian conference on robotics and automation (ACRA) (pp. 2–4).Google Scholar
  6. Cheng, M.-M., Zhang, Z., Lin, W.-Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fps. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3286–3293).Google Scholar
  7. Cummins, M., & Newman, P. (2011). Appearance-only SLAM at large scale with FAB-MAP 2.0. The International Journal of Robotics Research, 30(9), 1100–1123.CrossRefGoogle Scholar
  8. Dalal, N., & Triggs B. (2005). Histograms of oriented gradients for human detection. In International conference on computer vision and pattern recognition (CVPR) (pp. 886–893).Google Scholar
  9. Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. In International conference on very large data bases, San Francisco, CA (pp. 518–529).Google Scholar
  10. Glover, A., Maddern, W., Milford, M., & Wyeth, G. (2010). FAB-MAP + RatSLAM: appearance-based SLAM for multiple times of day. In IEEE international conference on robotics and automation (ICRA) (pp. 3507–3512).Google Scholar
  11. Glover, A., Maddern, W., Warren, M., Reid, S., Milford, M., & Wyeth, G. (2012). OpenFABMAP: An open source toolbox for appearance-based loop closure detection. In IEEE international conference on robotics and automation (ICRA) (pp. 4730–4735).Google Scholar
  12. Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 814–830.CrossRefGoogle Scholar
  13. Hou, Y., Zhang, H., & Zhou, S. (2015). Convolutional neural network-based image representation for visual loop closure detection. In IEEE international conference on information and automation (ICIA) (pp. 2238–2245).Google Scholar
  14. Hou, Y., Zhang, H., Zhou, S., & Zou H. (2017). Efficient ConvNet feature extraction with multiple RoI pooling for landmark-based visual localization of autonomous vehicles. In: Mobile information systems (Vol. 2017) (in press).Google Scholar
  15. Jégou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image search. In European conference on computer vision (ECCV) (pp. 304–317).Google Scholar
  16. Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3304–3311).Google Scholar
  17. Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional weighting for aggregated deep convolutional features. In: European conference on computer vision (ECCV) (pp. 685–701).Google Scholar
  18. Kosecka, J., & Li, F. (2004). Vision based topological Markov localization. In IEEE international conference on robotics and automation (ICRA) (Vol. 2, pp. 1481–1486).Google Scholar
  19. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).Google Scholar
  20. Li, F., & Kosecka, J. (2006). Probabilistic location recognition using reduced feature set. In IEEE international conference on robotics and automation (ICRA) (pp. 3405–3410).Google Scholar
  21. Liu, Y., & Zhang, H. (2012). Visual loop closure detection with a compact image descriptor. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1051–1056).Google Scholar
  22. Liu, Y. & Zhang, H. (2013). Towards improving the efficiency of sequence-based SLAM. In IEEE international conference on mechatronics and automation (ICMA) (pp. 1261–1266).Google Scholar
  23. Liu, Y., Feng, R., & Zhang, H. (2015). Keypoint matching by outlier pruning with consensus constraint. In IEEE international conference on robotics and automation (ICRA) (pp. 5481–5486).Google Scholar
  24. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60, 91–110.CrossRefGoogle Scholar
  25. Lowry, S., Süenderhauf, N., Newman, P., Leonard, J., Cox, D., Corke, P., et al. (2016). Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1), 1–19.CrossRefGoogle Scholar
  26. Milford, M. (2013). Vision-based place recognition: how low can you go? The International Journal of Robotics Research, 32(7), 766–789.CrossRefGoogle Scholar
  27. Milford, M., & Wyeth, G. (2012). SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In IEEE international conference on robotics and automation (ICRA) (pp. 1643–1649).Google Scholar
  28. Naseer, T., Spinello, L., Burgard, W., & Stachniss, C. (2014). Robust visual robot localization across seasons using network flows. In The AAAI conference on artificial intelligence.Google Scholar
  29. Neubert, P., & Protzel, P. (2015). Local region detector + CNN based landmarks for practical place recognition in changing environments. In European conference on mobile robots (ECMR) (pp. 1–6).Google Scholar
  30. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.CrossRefMATHGoogle Scholar
  31. Pepperell, E., Corke, P., & Milford, M. (2014). All-environment visual place recognition with SMART. In IEEE international conference on robotics and automation (ICRA) (pp. 1612–1618).Google Scholar
  32. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).Google Scholar
  33. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In European conference on computer vision (ECCV) (pp. 143–156).Google Scholar
  34. Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).Google Scholar
  35. Singh, G., & Kosecka, J. (2010). Visual loop closing using gist descriptors in manhattan world. In IEEE international conference on robotics and automation (ICRA) omnidirectional robot vision workshop.Google Scholar
  36. Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In IEEE international conference on computer vision (ICCV) (pp. 1470–1477).Google Scholar
  37. Süenderhauf, N., & Protzel, P. (2011). BRIEF-Gist—Closing the loop by simple means. In IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1234–1241).Google Scholar
  38. Süenderhauf, N., Dayoub, F., Shirazi, S., Upcroft, B., & M. Milford (2015a). On the performance of ConvNet features for place recognition. In IEEE international conference on intelligent robots and systems (IROS).Google Scholar
  39. Süenderhauf, N., Neubert, P., & Protzel, P. (2013). Are we there yet? Challenging seqslam on a 3000 km journey across all four seasons. In IEEE international conference on robotics and automation (ICRA) workshop on long-term autonomy.Google Scholar
  40. Süenderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., & Milford, M. (2015b). Place recognition with ConvNet landmarks: viewpoint-robust, condition-robust, training-free. In Robotics: science and systems (RSS), Rome.Google Scholar
  41. Zhang, H. (2011). BoRF: Loop-closure detection with scale invariant visual features. In IEEE international conference on robotics and automation (ICRA) (pp. 3125–3130).Google Scholar
  42. Zhang, H., Han, F., & Wang, H. (2016). Robust multimodal sequence-based loop closure detection via structured sparsity. In Robotics: Science and systems (RSS).Google Scholar
  43. Zheng, L., Yang, Y., & Tian, Q. (2016). SIFT meets CNN: a decade survey of instance retrieval. In IEEE transactions on pattern analysis and machine intelligence (vol. PP, no. 99, pp. 1–1).Google Scholar
  44. Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV) (pp. 391–405).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.College of Electronic Science and EngineeringNational University of Defense TechnologyChangshaPeople’s Republic of China
  2. 2.Department of Computing ScienceUniversity of AlbertaEdmontonCanada

Personalised recommendations