Abstract
Vision-based deep learning perception fulfills a paramount role in robotics, facilitating solutions to many challenging scenarios, such as acrobatic maneuvers of autonomous unmanned aerial vehicles (UAVs) and robot-assisted high-precision surgery. Control-oriented end-to-end perception approaches, which directly output control variables for the robot, commonly take advantage of the robot’s state estimation as an auxiliary input. When intermediate outputs are estimated and fed to a lower-level controller, i.e., mediated approaches, the robot’s state is commonly used as an input only for egocentric tasks, which estimate physical properties of the robot itself. In this work, we propose to apply a similar approach for the first time – to the best of our knowledge – to non-egocentric mediated tasks, where the estimated outputs refer to an external subject. We prove how our general methodology improves the regression performance of deep convolutional neural networks (CNNs) on a broad class of non-egocentric 3D pose estimation problems, with minimal computational cost. By analyzing three highly-different use cases, spanning from grasping with a robotic arm to following a human subject with a pocket-sized UAV, our results consistently improve the R\(^{2}\) regression metric, up to +0.51, compared to their stateless baselines. Finally, we validate the in-field performance of a closed-loop autonomous cm-scale UAV on the human pose estimation task. Our results show a significant reduction, i.e., 24% on average, on the mean absolute error of our stateful CNN, compared to a State-of-the-Art stateless counterpart.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: IEEE international conference on robotics and automation (ICRA). IEEE 2016, 3406–3413 (2016)
Palossi, D., Zimmerman, N., Burrello, A., Conti, F., Müller, H., Gambardella, L.M., Benini, L., Giusti, A., Guzzi, J.: Fully onboard AI-powered human-drone pose estimation on ultra-low power autonomous flying nano-UAVs, IEEE Int. Things J. (2021) pp. 1–1https://doi.org/10.1109/JIOT.2021.3091643
Loquercio, A., Kaufmann, E., Ranftl, R., Müller, M., Koltun, V., Scaramuzza, D.: Learning high-speed flight in the wild. Sci. Robot. 6(59), (2021) eabg5810. https://doi.org/10.1126/scirobotics.abg5810
Kaufmann, E., Loquercio, A., Ranftl, R., Mueller, M., Koltun, V., Scaramuzza, D.: Deep drone acrobatics. In: Robotics science and systems XVI, pp. 4780–4783 (2020)
Clark, R., Wang, S., Wen, H., Markham, A., Trigoni, N.: VINet: Visual-inertial odometry as a sequence-to-sequence learning problem. Proceedings of the AAAI conference on artificial intelligence 31(1) (2017). https://doi.org/10.1609/aaai.v31i1.11215
Han, L., Lin, Y., Du, G., Lian, S.: DeepVIO: Self-supervised deep learning of monocular visual inertial odometry using 3d geometric constraints, in. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2019, 6906–6913 (2019). https://doi.org/10.1109/IROS40897.2019.8968467
Abekawa, N., Ferrè, E.R., Gallagher, M., Gomi, H., Haggard, P.: Disentangling the visual, motor and representational effects of vestibular input. Cortex 104, 46–57 (2018)
Ferrè, E.R., Alsmith, A.J., Haggard, P., Longo, M.R.: The vestibular system modulates the contributions of head and torso to egocentric spatial judgements. Exp. Brain Res. 239(7), 2295–2302 (2021)
Clement, G., Fraysse, M.-J., Deguine, O.: Mental representation of space in vestibular patients with otolithic or rotatory vertigo. NeuroReport 20(5), 457–461 (2009)
Clément, G., Skinner, A., Richard, G., Lathan, C.: Geometric illusions in astronauts during long-duration spaceflight. NeuroReport 23(15), 894–899 (2012)
Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17(1), 1334–1373 (2016)
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke V et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on robot learning, PMLR, pp. 651–673 (2018)
Pillai, S., Leonard, J.J.: Towards visual ego-motion learning in robots, in. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2017, 5533–5540 (2017). https://doi.org/10.1109/IROS.2017.8206441
Cereda, E., Ferri, M., Mantegazza, D., Zimmerman, N., Gambardella, L.M., Guzzi, J., Giusti, A., Palossi, D.: Improving the generalization capability of DNNs for ultra-low power autonomous nano-UAVs. In: 2021 17th International conference on distributed computing in sensor systems (DCOSS), pp. 327–334 (2021) https://doi.org/10.1109/DCOSS52077.2021.00060
Li, S., De Wagter, C., De Croon, G.C.H.E.: Self-supervised monocular multi-robot relative localization with efficient deep neural networks, in. International Conference on Robotics and Automation (ICRA) 2022, 9689–9695 (2022). https://doi.org/10.1109/ICRA46639.2022.9812150
Kaufmann, E., Gehrig, M., Foehn, P., Ranftl, R., Dosovitskiy, A., Koltun, V., Scaramuzza, D.: Beauty and the beast: Optimal methods meet learning for drone racing. In: 2019 International conference on robotics and automation (ICRA), IEEE, pp. 690–696 (2019)
Jung, S., Hwang, S., Shin, H., Shim, D.H.: Perception, guidance, and navigation for indoor autonomous drone racing using deep learning. IEEE Robotics and Automation Letters 3(3), 2539–2544 (2018)
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE 2017, 23–30 (2017)
Zeng, A., Yu, K.-T., Song, S., Suo, D., Walker, E., Rodriguez, A., Xiao, J.: Multi-view self-supervised deep learning for 6D pose estimation in the Amazon picking challenge. In: IEEE international conference on robotics and automation (ICRA). IEEE 2017, 1383–1386 (2017)
Nava, M., Paolillo, A., Guzzi, J., Gambardella, L.M., Giusti, A.: Uncertainty-aware self-supervised learning of spatial perception tasks. IEEE Robotics and Automation Letters 6(4), 6693–6700 (2021)
Shorten, C., Khoshgoftaar, T.: A survey on image data augmentation for deep learning. J. Big Data 6 (2019). https://doi.org/10.1186/s40537-019-0197-0
Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: Advances in neural information processing systems, vol 33, Curran Associates, Inc., pp 6256–6268 (2020)
Zheng, Q., Zhao, P., Li, Y., Wang, H., Yang, Y.: Spectrum interference-based two-level data augmentation method in deep learning for automatic modulation classification. Neural Comput. Appl. 33(13), 7723–7745 (2021)
Wan, Y., Gao, W., Han, S., Wu, Y.: Boosting image-based localization via randomly geometric data augmentation, in. IEEE International Conference on Image Processing (ICIP) 2020, 688–692 (2020). https://doi.org/10.1109/ICIP40778.2020.9190809
Guerry, J., Boulch, A., Le Saux, B., Moras, J., Plyer, A., Filliat, D.: SnapNet-R: Consistent 3D multi-view semantic labeling for robotics. In: Proceedings of the IEEE international conference on computer vision (ICCV) Workshops, pp. 669–678 (2017)
Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T.-Y., Shlens, J., Le, Q.V.: Learning data augmentation strategies for object detection. In: European conference on computer vision, Springer, pp. 566–583 (2020)
Coleman, D., Sucan, I. A., Chitta, S., Correll, N.: Reducing the barrier to entry of complex robotic software: a MoveIt! case study. J. Softw. Eng. Robot. (2014)
Palossi, D., Conti, F., Benini, L.: An open source and open hardware deep learning-powered visual navigation engine for autonomous nano-uavs. In: 2019 15th International conference on distributed computing in sensor systems (DCOSS), pp. 604–611 (2019). https://doi.org/10.1109/DCOSS.2019.00111
Gautschi, M., Schiavone, P.D., Traber, A., Loi, I., Pullini, A., Rossi, D., Flamand, E., Gürkaynak, F.K., Benini, L.: Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices. IEEE Trans. Very Large Scale Integr. (VLSI) Systems 25(10) (2017). https://doi.org/10.1109/TVLSI.2017.2654506
Clarke, T.A., Fryer, J.G.: The development of camera calibration methods and models. Photogram. Rec. 16(91), 51–66 (1998)
Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 2174–2182 (2017)
Redmon, J., Farhadi, A.: https://arxiv.org/abs/1804.02767YOLOv3: An incremental improvement (2018). https://doi.org/10.48550/ARXIV.1804.02767. arXiv:1804.02767
Funding
Open access funding provided by Universitá della Svizzera italiana. This work was partially supported by the Secure Systems Research Center (SSRC) of the UAE Technology Innovation Institute (TII) and the Swiss National Science Foundation (SNSF) through the NCCR Robotics.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. E.C. wrote the main manuscript text and contributed the implementation and experiments for the drone-to-human use case. S.B. contributed the drone-to-drone use case. M.N. contributed the robot arm-to-object use case. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval
This is an observational study. No ethical approval is required for this article.
Consent to participate
Informed consent was obtained from all individual participants included in the study.
Consent to publish
The authors affirm that human research participants provided informed consent for publication of the image in Fig. 1.
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file 1 (mp4 51222 KB)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cereda, E., Bonato, S., Nava, M. et al. Vision-state Fusion: Improving Deep Neural Networks for Autonomous Robotics. J Intell Robot Syst 110, 58 (2024). https://doi.org/10.1007/s10846-024-02091-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10846-024-02091-6