Abstract
In this chapter we discuss the problem of master failure in ROS 1.0 and its impact on robotic deployments in the real world. We address this issue in this tutorial chapter where we outline, design and demonstrate a fault tolerant mechanism associated with a ROS master failure. Unlike previous solutions which use primary backup replication and external checkpointing libraries which are resource demanding, our mechanism adds a lightweight functionality to the ROS master to enable it to recover from failure. We present a modified version of the ROS master which is equipped with a logging mechanism to record the meta information and network state of ROS nodes as well as a recovery mechanism to go back to the previous state without having to abort or restart all the nodes. We also implement an additional master monitor node responsible for failure detection on the master by polling it for its availability. Our code is implemented in Python and preliminary tests were conducted successfully on a variety of land, aerial and underwater robots and a teleoperated computer running ROS Kinetic on Ubuntu 16.04. The code is publicly available under a Creative Commons license on Github at https://github.com/PushyamiKaveti/fault-tolerant-ros-master.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dataspeed inc drive-by-wire adas kit, https://bitbucket.org/DataspeedInc/dbw_mkz_simulation/src/default
Dji matrice m100 quadcopter for developers, https://www.dji.com/matrice100
The rise of the robot operating system, https://roboticsandautomationnews.com/2019/05/16/the-rise-of-the-robot-operating-system/22485/
J. Ansel, K. Arya, G. Cooperman, DMTCP: transparent checkpointing for cluster computations and the desktop, in IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium (2009). https://doi.org/10.1109/IPDPS.2009.5161063
S. Ghemawat, H. Gobioff, S.T. Leung, The google file system. Oper. Syst. Rev. (ACM) (2003). https://doi.org/10.1145/1165389.945450
T. Jain, G. Cooperman, Dmtcp: fixing the single point of failure of the ros master (2017)
M. Lauer, M. Amy, J.C. Fabre, M. Roy, W. Excoffon, M. Stoicescu, Resilient computing on ros using adaptive fault tolerance. J. Softw.: Evol. Process. 30(3), e1917 (2018)
M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, A.Y. Ng, ROS: an open-source Robot Operating System, in ICRA Workshop on Open Source Software (2009)
P. Yoonseok, J. Leon, Turtlebot3 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kaveti, P., Singh, H. (2021). ROS Rescue: Fault Tolerance System for Robot Operating System. In: Koubaa, A. (eds) Robot Operating System (ROS). Studies in Computational Intelligence, vol 895. Springer, Cham. https://doi.org/10.1007/978-3-030-45956-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-45956-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45955-0
Online ISBN: 978-3-030-45956-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)