Abstract
Visual-inertial odometry (VIO) is the pose estimation backbone for most AR/VR and autonomous robotic systems today, in both academia and industry. However, these systems are highly sensitive to the initialization of key parameters such as sensor biases, gravity direction, and metric scale. In practical scenarios where high-parallax or variable acceleration assumptions are rarely met (e.g. hovering aerial robot, smartphone AR user not gesticulating with phone), classical visual-inertial initialization formulations often become ill-conditioned and/or fail to meaningfully converge. In this paper we target visual-inertial initialization specifically for these low-excitation scenarios critical to in-the-wild usage. We propose to circumvent the limitations of classical visual-inertial structure-from-motion (SfM) initialization by incorporating a new learning-based measurement as a higher-level input. We leverage learned monocular depth images (mono-depth) to constrain the relative depth of features, and upgrade the mono-depths to metric scale by jointly optimizing for their scales and shifts. Our experiments show a significant improvement in problem conditioning compared to a classical formulation for visual-inertial initialization, and demonstrate significant accuracy and robustness improvements relative to the state-of-the-art on public benchmarks, particularly under low-excitation scenarios. We further extend this improvement to implementation within an existing odometry system to illustrate the impact of our improved initialization method on resulting tracking trajectories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, S., Mierle, K., Others: Ceres solver. https://ceres-solver.org
Almalioglu, Y., et al.: SelfVIO: self-supervised deep monocular visual-inertial odometry and depth estimation. CoRR abs/1911.09968 (2019). https://doi.org/arxiv.org/abs/1911.09968
Barron, J.T.: A general and adaptive robust loss function. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4331–4339 (2019)
Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: CodeSLAM-learning a compact, optimisable representation for dense visual slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2560–2568 (2018)
Burru, M., et al.: The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 35(10), 1157–1163 (2016)
Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Trans. Robot. 37(6), 1874–1890 (2021)
Campos, C., Montiel, J.M.M., Tardós, J.D.: Fast and robust initialization for visual-inertial SLAM. CoRR abs/1908.10653 (2019), https://doi.org/arxiv.org/abs/1908.10653
Campos, C., Montiel, J.M., Tardós, J.D.: Inertial-only optimization for visual-inertial initialization. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 51–57. IEEE (2020)
Chen, C., Lu, X., Markham, A., Trigoni, N.: IONet: learning to cure the curse of drift in inertial odometry. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Chen, C., et al.: Selective sensor fusion for neural visual-inertial odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10542–10551 (2019)
Civera, J., Davison, A.J., Montiel, J.M.: Inverse depth parametrization for monocular slam. IEEE Trans. Rob. 24(5), 932–945 (2008)
Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: ViNet: visual-inertial odometry as a sequence-to-sequence learning problem. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017)
Concha, A., Civera, J.: RGBDTAM: A cost-effective and accurate RGB-D tracking and mapping system. CoRR abs/1703.00754 (2017). https://doi.org/arxiv.org/abs/1703.00754
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Du, R., et al.: DepthLab: real-time 3D interaction with depth maps for mobile augmented reality. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 829–843 (2020)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. CoRR abs/1406.2283 (2014). https://doi.org/arxiv.org/abs/1406.2283
Endres, F., Hess, J., Sturm, J., Cremers, D., Burgard, W.: 3-D mapping with an RGB-D camera. IEEE Trans. Rob. 30(1), 177–187 (2013)
Fei, X., Soatto, S.: Xivo: an open-source software for visual-inertial odometry (2019). https://doi.org/github.com/ucla-vision/xivo
Forster, C., Carlone, L., Dellaert, F., Scaramuzza, D.: On-manifold preintegration theory for fast and accurate visual-inertial navigation. CoRR abs/1512.02363 (2015). https://doi.org/arxiv.org/abs/1512.02363
Forster, C., Pizzoli, M., Scaramuzza, D.: SVO: fast semi-direct monocular visual odometry. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 15–22. IEEE (2014)
Garg, R., Wadhwa, N., Ansari, S., Barron, J.T.: Learning single camera depth estimation using dual-pixels. CoRR abs/1904.05822 (2019). https://doi.org/arxiv.org/abs/1904.05822
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Rob. Res. (IJRR) 32(11), 1231–1237 (2013)
Geneva, P., Eckenhoff, K., Lee, W., Yang, Y., Huang, G.: OpenVINS: a research platform for visual-inertial estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4666–4672. IEEE (2020)
Guennebaud, G., Jacob, B., et al.: Eigen v3 (2010). https://eigen.tuxfamily.org
Guo, C.X., Roumeliotis, S.I.: IMU-RGBD camera 3D pose estimation and extrinsic calibration: observability analysis and consistency improvement. In: 2013 IEEE International Conference on Robotics and Automation, pp. 2935–2942 (2013). https://doi.org/10.1109/ICRA.2013.6630984
Han, L., Lin, Y., Du, G., Lian, S.: DeepVIO: self-supervised deep learning of monocular visual inertial odometry using 3D geometric constraints. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6906–6913. IEEE (2019)
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, ISBN: 0521540518 (2004)
Herath, S., Yan, H., Furukawa, Y.: RoNIN: robust neural inertial navigation in the wild: benchmark, evaluations, and new methods. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3146–3152 (2020). https://doi.org/10.1109/ICRA40945.2020.9196860
Hernandez, J., Tsotsos, K., Soatto, S.: Observability, identifiability and sensitivity of vision-aided inertial navigation. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 2319–2325. IEEE (2015)
Huai, Z., Huang, G.: Robocentric visual-inertial odometry. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6319–6326. IEEE (2018)
Huang, G.: Visual-inertial navigation: a concise review. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 9572–9582 (2019). https://doi.org/10.1109/ICRA.2019.8793604
Huber, P.J.: Robust estimation of a location parameter. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in statistics, pp. 492–518. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_35
Jones, E., Vedaldi, A., Soatto, S.: Inertial structure from motion with autocalibration. In: Workshop on Dynamical Vision, vol. 25, p. 11 (2007)
Kaiser, J., Martinelli, A., Fontana, F., Scaramuzza, D.: Simultaneous state initialization and gyroscope bias calibration in visual inertial aided navigation. IEEE Rob. Autom. Lett. 2(1), 18–25 (2017). https://doi.org/10.1109/LRA.2016.2521413
Kelly, J., Sukhatme, G.S.: Visual-inertial sensor fusion: localization, mapping and sensor-to-sensor self-calibration. Int. J. Rob. Res. 30(1), 56–79 (2011)
Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html (2017)
Lepetit, V., Moreno-Noguer, F., Fua, P.: EPNP: an accurate o(n) solution to the PNP problem. Int. J. Computer Vis. 81(2), 155 (2009)
Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Rob. Res. 34(3), 314–334 (2015)
Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Rob. Res. 34(3), 314–334 (2015)
Li, C., Waslander, S.L.: Towards end-to-end learning of visual inertial odometry with an EKF. In: 2020 17th Conference on Computer and Robot Vision (CRV), pp. 190–197. IEEE (2020)
Li, J., Bao, H., Zhang, G.: Rapid and robust monocular visual-inertial initialization with gravity estimation via vertical edges. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6230–6236 (2019). https://doi.org/10.1109/IROS40897.2019.8968456
Li, M., Mourikis, A.I.: A convex formulation for motion estimation using visual and inertial sensors. In: Proceedings of the Workshop on Multi-View Geometry, held in conjunction with RSS. Berkeley, CA, July 2014
Li, M., Mourikis, A.I.: High-precision, consistent EKF-based visual-inertial odometry. Int. J. Rob. Res. 32(6), 690–711 (2013)
Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Liu, W., et al.: TLIO: tight learned inertial odometry. IEEE Rob. Autom. Lett. 5(4), 5653–5660 (2020)
Martinelli, A.: Closed-form solution of visual-inertial structure from motion. Int. J. Comput. Vision 106(2), 138–152 (2014)
Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017). https://doi.org/10.1109/TRO.2017.2705103
Qin, T., Li, P., Shen, S.: VINS-Mono: a robust and versatile monocular visual-inertial state estimator. CoRR abs/1708.03852 (2017). https://doi.org/arxiv.org/abs/1708.03852
Qin, T., Shen, S.: Robust initialization of monocular visual-inertial estimation on aerial robots. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4225–4232 (2017). https://doi.org/10.1109/IROS.2017.8206284
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. ArXiv preprint (2021)
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(3), 1623–1637 (2020)
Scaramuzza, D., Fraundorfer, F.: Visual odometry [tutorial]. IEEE Rob. Autom. Mag. 18(4), 80–92 (2011). https://doi.org/10.1109/MRA.2011.943233
Tang, C., Tan, P.: BA-Net: dense bundle adjustment networks. In: International Conference on Learning Representations (2018)
Troiani, C., Martinelli, A., Laugier, C., Scaramuzza, D.: 2-point-based outlier rejection for camera-IMU systems with applications to micro aerial vehicles. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 5530–5536 (2014). https://doi.org/10.1109/ICRA.2014.6907672
Tsotsos, K., Chiuso, A., Soatto, S.: Robust inference for visual-inertial sensor fusion. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 5203–5210. IEEE (2015)
Von Stumberg, L., Usenko, V., Cremers, D.: Direct sparse visual-inertial odometry using dynamic marginalization. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2510–2517. IEEE (2018)
Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. IEEE (2017)
Whelan, T., Leutenegger, S., Salas-Moreno, R., Glocker, B., Davison, A.: Elasticfusion: dense slam without a pose graph. In: Robotics: Science and Systems (2015)
Wu, K.J., Guo, C.X., Georgiou, G., Roumeliotis, S.I.: Vins on wheels. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5155–5162. IEEE (2017)
Zuo, X., Merrill, N., Li, W., Liu, Y., Pollefeys, M., Huang, G.: Codevio: visual-inertial odometry with learned optimizable dense depth. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 14382–14388. IEEE (2021)
Zuñiga-Noël, D., Moreno, F.A., Gonzalez-Jimenez, J.: An analytical solution to the IMU initialization problem for visual-inertial systems. IEEE Rob. Autom. Lett. 6(3), 6116–6122 (2021). https://doi.org/10.1109/LRA.2021.3091407
Acknowledgements
We thank Josh Hernandez and Maksym Dzitsiuk for their support in developing our real-time system implementation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, Y. et al. (2022). Learned Monocular Depth Priors in Visual-Inertial Initialization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13682. Springer, Cham. https://doi.org/10.1007/978-3-031-20047-2_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-20047-2_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20046-5
Online ISBN: 978-3-031-20047-2
eBook Packages: Computer ScienceComputer Science (R0)