Skip to main content
Log in

Improving Image Description with Auxiliary Modality for Visual Localization in Challenging Conditions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Image indexing for lifelong localization is a key component for a large panel of applications, including robot navigation, autonomous driving or cultural heritage valorization. The principal difficulty in long-term localization arises from the dynamic changes that affect outdoor environments. In this work, we propose a new approach for outdoor large scale image-based localization that can deal with challenging scenarios like cross-season, cross-weather and day/night localization. The key component of our method is a new learned global image descriptor, that can effectively benefit from scene geometry information during training. At test time, our system is capable of inferring the depth map related to the query image and use it to increase localization accuracy. We show through extensive evaluation that our method can improve localization performances, especially in challenging scenarios when the visual appearance of the scene has changed. Our method is able to leverage both visual and geometric clues from monocular images to create discriminative descriptors for cross-season localization and effective matching of images acquired at different time periods. Our method can also use weakly annotated data to localize night images across a reference dataset of daytime images. Finally we extended our method to reflectance modality and we compare multi-modal descriptors respectively based on geometry, material reflectance and a combination of both.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Anoosheh, A., Agustsson, E., Timofte, R., & Van Gool, L. (2018). Combogan: Unrestrained scalability for image domain translation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 783–790).

  • Anoosheh, A., Sattler, T., Timofte, R., Pollefeys, M., & Van Gool, L. (2019). Night-to-day image translation for retrieval-based localization. In International conference on robotics and automation (ICRA) (pp. 5958–5964). IEEE.

  • Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2017). NetVLAD: CNN architecture for weakly supervised place recognition. In IEEE transactions on pattern analysis and machine intelligence (TPAMI) (pp. 5297–5307).

  • Arandjelović, R. & Zisserman, A. (2014). DisLocation : Scalable descriptor. In Asian conference on computer vision (ACCV).

  • Ardeshir, S., Zamir, A. R., Torroella, A., & Shah, M. (2014). GIS-assisted object detection and geospatial localization. In European conference on computer vision (ECCV), LNCS (Vol. 8694, pp. 602–617).

  • Aubry, M., Russell, B. C., & Sivic, J. (2014). Painting-to-3D model alignment via discriminative visual elements. ACM Transactions on Graphics (ToG), 33(2), 1–14.

    Article  Google Scholar 

  • Azzi, C., Asmar, D., Fakih, A., & Zelek, J. (2016). Filtering 3D keypoints using GIST for accurate image-based localization. In British machine vision conference (BMVC) (Vol. 2, pp. 1–12).

  • Balntas, V., Riba, E., Ponsa, D., & Mikolajczyk, K. (2016). Learning local feature descriptors with triplets and shallow convolutional neural networks. In BMVC (Vol. 1, p. 3).

  • Bansal, A., Badino, H., & Huber, D. (2014). Understanding how camera configuration and environmental conditions affect appearance-based localization. InIEEE intelligent vehicles symposium (IV) (pp. 800–807).

  • Bevilacqua, M., Aujol, J. F., Biasutti, P., Brédif, M., & Bugeau, A. (2017). Joint inpainting of depth and reflectance with visibility estimation. ISPRS Journal of Photogrammetry and Remote Sensing, 125, 16–32.

    Article  Google Scholar 

  • Bhowmik, N., Weng, L., Gouet-Brunet, V., & Soheilian, B. (2017). Cross-domain image localization by adaptive feature fusion. In Joint urban remote sensing event (JURSE).

  • Brachmann, E. & Rother, C. (2018). Learning less is more—6D camera localization via 3D surface regression. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Cao, Y., Long, M., Wang, J., Zhu, H., & Wen, Q. (2016). Deep quantization network for efficient image retrieval. In AAAI conference on artificial intelligence.

  • Cao, Z., Long, M., Wang, J., & Yu, P. S. (2017). Hashnet: Deep learning to hash by continuation. In Proceedings of the IEEE international conference on computer vision (pp. 5608–5617).

  • Chevalier, M., Thome, N., Hénaff, G., & Cord, M. (2018). Classifying low-resolution images by integrating privileged information in deep CNNs. Pattern Recognition Letters, 116, 29–35.

    Article  Google Scholar 

  • Christie, G., Warnell, G., & Kochersberger, K. (2016). Semantics for UGV registration in GPS-denied environments. arXiv:1609.04794.

  • Chum, O., Mikul, A., Perdoch, M., & Matas, J. (2011). Total recall II : Query expansion revisited. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Chum, O., Philbin, J., Sivic, J., Isard, M., & Zisserman, A. (2007). Total recall: Automatic query expansion with a generative feature model for object retrieval. In IEEE international conference on computer vision (ICCV).

  • Deng, C., Chen, Z., Liu, X., Gao, X., & Tao, D. (2018). Triplet-based deep hashing network for cross-modal retrieval. IEEE Transactions on Image Processing, 27(8), 3893–3903.

    Article  MathSciNet  Google Scholar 

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Annual conference on neural information processing systems (NIPS) (pp. 1–9).

  • Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M., & Burgard, W. (2015). Multimodal deep learning for robust RGB-D object recognition. In IEEE international conference on intelligent robots and systems (IROS) (Vol. 2015, pp. 681–687).

  • Garg, S., Suenderhauf, N., & Milford, M. (2018a). Don’t look back: Robustifying place categorization for viewpoint- and condition-invariant place recognition. In IEEE international conference on robotics and automation (ICRA).

  • Garg, S., Suenderhauf, N., & Milford, M. (2018b). LoST? Appearance-invariant place recognition for opposite viewpoints using visual semantics. In Robotics science and systems (RSS).

  • Germain, H., Bourmaud, G., & Lepetit, V. (2018). Improving nighttime retrieval-based localization. arXiv:1812.03707.pdf.

  • Germain, H., Bourmaud, G., & Lepetit, V. (2019, September). Sparse-to-dense hypercolumn matching for long-term visual localization. In 2019 International Conference on 3D Vision (3DV) (pp. 513–523). IEEE.

  • Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Gordo, A., Almazán, J., Revaud, J., & Larlus, D. (2016). Deep image retrieval: Learning global representations for image search. In European conference on computer vision (ECCV) (Vol. 9905, pp. 241–257).

  • Gordo, A., Almazán, J., Revaud, J., & Larlus, D. (2017). End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision (IJCV), 124(2), 237–254.

    Article  MathSciNet  Google Scholar 

  • Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (ECCV), LNCS (Vol. 8695, pp. 345–360).

  • Hays, J. & Efros, A. A. (2008). IM2GPS: Estimating geographic information from a single image. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 05).

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Hoffman, J., Gupta, S., & Darrell, T. (2016). Learning with side information through modality hallucination. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 826–834).

  • Iscen, A., Tolias, G., Avrithis, Y., & Chum, O. (2018). Mining on manifolds: Metric learning without labels.

  • Isola, P., Zhu, J.-Y. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1125–1134).

  • Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In IEEE conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1169–1176).

  • Jiang, Q.-Y., & Li, W.-J. (2017). Deep cross-modal hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3232–3240).

  • Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.

  • Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale similarity search with GPUs. https://ieeexplore.ieee.org/abstract/document/8733051.

  • Kim, H. J., Dunn, E., & Frahm, J.-M. (2017). Learned contextual feature reweighting for image geo-localization. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Lai, H., Pan, Y., Liu, Y., & Yan, S. (2015). Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3270–3278).

  • Li, W., Chen, L., Xu, D., & Van Gool, L. (2018). Visual recognition in RGB images and videos by learning from RGB-D data. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(8), 2030–2036.

    Article  Google Scholar 

  • Liu, L., Li, H., & Dai, Y. (2019). Stochastic attraction-repulsion embedding for large scale image localization. In proceedings of the IEEE international conference on computer vision (pp. 2570–2579).

  • Long, M., Cao, Y., Cao, Z., Wang, J., & Jordan, M. I. (2018). Transferable representation learning with deep adaptation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12), 3071–3085.

    Article  Google Scholar 

  • Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). CNN-SVO: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE international conference on robotics and automation (ICRA) (Vol. 1).

  • Lowry, S., Sünderhauf, N., Newman, P., Leonard, J. J., Cox, D., Corke, P., et al. (2016). Visual place recognition: A survey. IEEE Transactions on Robotics (TRO), 32(1), 1–19.

    Article  Google Scholar 

  • Maddern, W., Pascoe, G., Linegar, C., & Newman, P. (2016). 1 year, 1000 km: The Oxford RobotCar dataset. The International Journal of Robotics Research (IJRR), 36, 3–15.

    Article  Google Scholar 

  • Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Milford, M. J., & Wyeth, G. F. (2012). SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In IEEE international conference on robotics and automation (ICRA) (pp. 1643–1649).

  • Morago, B., Bui, G., & Duan, Y. (2016). 2D matching using repetitive and salient features in architectural images. IEEE Transactions on Image Processing (ToIP), 7149(c), 1–12.

    MATH  Google Scholar 

  • Mordan, T., Thome, N., Henaff, G., & Cord, M. (2018). Revisiting multi-task learning with rock: A deep residual auxiliary block for visual detection. In Advances in neural information processing systems (pp. 1310–1322).

  • Muja, M., & Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. In International conference on computer vision theory and applications (VISAPP) (pp. 1–10).

  • Naseer, T., Burgard, W., & Stachniss, C. (2018). Robust visual localization across seasons. IEEE Transactions on Robotics (TRO), 34(2), 289–302.

    Article  Google Scholar 

  • Naseer, T., Oliveira, G. L., Brox, T., & Burgard, W. (2017). Semantics-aware visual localization under challenging perceptual conditions. In IEEE international conference on robotics and automation (ICRA) (pp. 2614–2620).

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3), 145–175.

    Article  Google Scholar 

  • Paulin, M., Mairal, J., Douze, M., Harchaoui, Z., Perronnin, F., & Schmid, C. (2017). Convolutional patch representations for image retrieval: An unsupervised approach. International Journal of Computer Vision (IJCV), 121(1), 149–168.

    Article  Google Scholar 

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Object retrieval with large vocabularies and fast spatial matching. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Piasco, N., Sidibé, D., Demonceaux, C., & Gouet-Brunet, V. (2018). A survey on visual-based localization: On the benefit of heterogeneous data. Pattern Recognition, 74, 90–109.

    Article  Google Scholar 

  • Piasco, N., Sidibé, D., Demonceaux, C., & Gouet-Brunet, V. (2019a). Geometric camera pose refinement with learned depth maps. In IEEE international conference on image processing (ICIP).

  • Piasco, N., Sidibé, D., Demonceaux, C., & Gouet-Brunet, V. (2019b). Perspective-n-learned-point: Pose estimation from relative depth. In British machine vision conference (BMVC).

  • Piasco, N., Sidibé, D., Gouet-Brunet, V., & Demonceaux, C. (2019c). Learning scene geometry for visual localization in challenging conditions. In IEEE international conference on robotics and automation (ICRA).

  • Porav, H., Bruls, T., & Newman, P. (2019). I can see clearly now: Image restoration via de-raining. In IEEE international conference on robotics and automation (ICRA).

  • Porav, H., Maddern, W., & Newman, P. (2018). Adversarial training for adverse conditions: Robust metric localisation using appearance transfer. IEEE international conference on robotics and automation (ICRA).

  • Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2016). PointNet: Deep learning on point sets for 3D classification and segmentation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Radenović, F., Tolias, G., & Chum, O. (2016). CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In European conference on computer vision (ECCV) (Vol. 9905, pp. 3–20).

  • Radenović, F., Tolias, G., & Chum, O. (2017). Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41, 1655–1668.

    Article  Google Scholar 

  • Russell, B. C., Sivic, J., Ponce, J., & Dessales, H. (2011). Automatic alignment of paintings and photographs depicting a 3D scene. In IEEE international conference on computer vision workshops (ICCVW).

  • Sarlin, P.-E., Cadena, C., Siegwart, R., & Dymczyk, M. (2019). From coarse to fine: Robust hierarchical localization at large scale. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Sarlin, P.-E., Debraine, F., Dymczyk, M., Siegwart, R., & Cadena, C. (2018). Leveraging deep visual descriptors for hierarchical efficient localization. In Conference on robot learning (CoRL) (pp. 1–10).

  • Sattler, T., Havlena, M., Schindler, K., & Pollefeys, M. (2016). Large-scale location recognition and the geometric burstiness problem. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand, L., Stenborg, E., et al. (2018a). Benchmarking 6DOF outdoor visual localization in changing conditions. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8601–8610).

  • Sattler, T., Maddern, W., Torii, A., Sivic, J., Pajdla, T., Pollefeys, M., & Okutomi, M. (2018b). Benchmarking 6DOF urban visual localization in changing conditions. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Schönberger, J. L., Pollefeys, M., Geiger, A., & Sattler, T. (2018). Semantic visual localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Seymour, Z., Sikka, K., Chiu, H.-P., Samarasekera, S., & Kumar, R. (2019). Semantically-aware attentive neural embeddings for long-term 2D visual localization. In British Machine Vision Conference (BMVC).

  • Sharmanska, V., Quadrianto, N., & Lampert, C. H. (2013). Learning to rank using privileged information. In Proceedings of the IEEE international conference on computer vision (pp. 825–832).

  • Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., & Fitzgibbon, A. (2013). Scene coordinate regression forests for camera relocalization in RGB-D images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2930–2937).

  • Sizikova, E., Singh, V. K., Georgescu, B., Halber, M., Ma, K., & Chen, T. (2016). Enhancing place recognition using joint intensity—depth analysis and synthetic data. In European conference on computer vision workshops (ECCVW) (pp. 1–8).

  • Stenborg, E., Toft, C., & Hammarstrand, L. (2018). Long-term visual localization using semantically segmented images. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 6484–6490). IEEE.

  • Sünderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., et al. (2015). Place recognition with ConvNet landmarks: Viewpoint-robust, condition-robust, training-free. In Robotics science and systems (RSS).

  • Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Toft, C., Stenborg, E., Hammarstrand, L., Brynte, L., Pollefeys, M., Sattler, T., & Kahl, F. (2018). Semantic match consistency for long-term visual localization. In European conference on computer vision (ECCV).

  • Torii, A., Arandjelović, R., Sivic, J., Okutomi, M., & Pajdla, T. (2015). 24/7 place recognition by view synthesis. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Torii, A., Sivic, J., Okutomi, M., & Pajdla, T. (2013). Visual place recognition with repetitive structures. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 37, 2346–2359.

    Google Scholar 

  • Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7167–7176).

  • Uy, M. A. & Lee, G. H. (2018). PointNetVLAD: Deep point cloud based retrieval for large-scale place recognition. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Vapnik, V., & Vashist, A. (2009). A new learning paradigm: Learning using privileged information. Neural Networks, 22(5–6), 544–557.

    Article  Google Scholar 

  • Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017). Detection, learning cross-modal deep representations for robust pedestrian. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5363–5371).

  • Zamir, A. R. & Shah, M. (2010). Accurate image localization based on google maps street view. In European conference on computer vision (ECCV), LNCS (Vol. 6314, pp. 255–268).

  • Zamir, A. R., & Shah, M. (2014). Image geo-localization based on multiplenearest neighbor feature matching using generalized graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(8), 1546–1558.

    Article  Google Scholar 

  • Zwald, L., & Lambert-Lacroix, S. (2012). The berhu penalty and the grouped effect. arXiv preprint arXiv:1207.6868.

Download references

Acknowledgements

We would like to acknowledge the French ANR project pLaTINUM (ANR-15-CE23-0010) for its financial support and Marco Bevilacqua for kindly sharing the code of his inpainting algorithm used in this research. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nathan Piasco.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Piasco, N., Sidibé, D., Gouet-Brunet, V. et al. Improving Image Description with Auxiliary Modality for Visual Localization in Challenging Conditions. Int J Comput Vis 129, 185–202 (2021). https://doi.org/10.1007/s11263-020-01363-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01363-6

Keywords

Navigation