Implementation of Q learning and deep Q network for controlling a self balancing robot model
Abstract
In this paper, the implementations of two reinforcement learnings namely, Q learning and deep Q network (DQN) on the Gazebo model of a self balancing robot have been discussed. The goal of the experiments is to make the robot model learn the best actions for staying balanced in an environment. The more time it can remain within a specified limit, the more reward it accumulates and hence more balanced it is. We did various tests with many hyperparameters and demonstrated the performance curves.
Introduction
Control system is one of the most critical aspects of Robotics Research. The Gazebo is one of the most robust multirobot simulators at present. The ability to use the Robot Operating System (ROS) with Gazebo makes it more powerful. However, there is very few documentation on how to use ROS and Gazebo for Controllers development. In our previous paper, [1], we attempted to demonstrate and document the use of PID, Fuzzy logic and LQR controllers using ROS and Gazebo on a selfbalancing robot model. Later on, we have worked on Reinforcement learning. In this paper, we have the implementation of Q Learning and Deep Q Network on the same model. The paper is structured as follows. “Related works” section shows the related works on the subject. “Robot model” section discusses the Robot Model. “Reinforcement learning methods as controllers” section shows the implementation of Q Learning and DQN as controllers. Finally, “Conclusion and future work” section is the conclusion.
Related works
Lei Tai and Ming Liu [2] had worked on Mobile Robots Exploration using CNN based reinforcement learning. They trained and simulated a TurtleBot on Gazebo to develop an exploration strategy based on raw sensor value from the RGBD sensor. The company ErleRobotics have extended OpenAI environment to Gazebo [3]. They have deployed Qlearning and Sarsa algorithms for various exploratory environments. Loc Tran et al. [4] developed a training model for an Unmanned aerial vehicle to explore with static obstacles in both Gazebo and the real world, but their proposed Reinforcement learning is unclear from the paper. Volodymyr Sereda [5] used Qlearning on a custom Gazebo model using ROS in exploration strategy. Rowan Border [6] used Qlearning with neural network presentation for robot search and rescue using ROS and Turtlebot.
Robot model
Controller
Reinforcement learning methods as controllers
Previously, we worked on traditional Controllers like PID, Fuzzy PD, PD+I & LQR [1]. The biggest problem with those methods is that they need to be tuned manually. So, reaching optimal values of controllers depends on many trials and errors. Many times optimum values aren’t achieved at all. The biggest benefit of reinforcement learning algorithms as controllers is that the model tunes itself to reach the optimum values. The following two sections discuss Q Learning and Deep Q Network (Additional file 1).
Q learning
Algorithm
The objective of the model in our project is to keep it within limits, i.e., ± 5°. At first, the robot model, Q matrix, policy \(\pi\) are initialized. There are some interesting points to make. The states are not finite. Within the limit range, hundreds and thousands of pitch angles are possible. Having thousands of columns is not possible. So, we discretized the state values into 20 state angles from − 10° to 10°. For action value, we chose ten different velocities and they are [− 200, − 100, − 50, − 25, − 10, 10, 25, 50, 100, 200] ms^{−1}. The Q matrix had 20 columns, each column representing a state and ten rows each representing every action. Initially, the Qvalues were assumed to be 0, and some random actions were specified for every state in the policy \(\pi\). We trained for 1500 episodes, each episode having 2000 iterations. At the beginning of each episode, the simulation refreshed. Whenever the robot’s state exceeded the limit, it was penalized by assigning a reward to \(100\). The Q Table is updated at each step according to Eq. 1. The Algorithm 1 shows the full algorithm. (Additional file 3)
Result and discussion
Deep Q network (DQN)

Experience Replay

Derivation of Q Values in one forward pass (Additional file 5).
Experience replay

It allows greater data efficiency as each step of experience can be used in many weight updates

Randomizing batches break correlations between samples

Behaviour distribution is averaged over many of its previous states.
Derivation of Q values in one forward pass
Implementation on the robot model
Architecture
Training
Comparison to traditional methods
In our previous paper, [1], we evaluated the performance of PID, Fuzzy Logic and LQR on a selfbalancing robot model and compared among those controllers. Figure 9 shows the performance curves for PID, Fuzzy P, LQR and DQN. It shows that LQR and Fuzzy controllers were not so stable like PID, although we had to tune all of them manually. The DQN performance curves are more stable than fuzzy P and LQR.But less stable than PID. There are two reasons behind being less stable can be, that the PID algorithm is giving continuous action values, while our architecture is designed for discrete values. The second reason is the reward function for this architecture is to limit the pitch angle between − 5° and 5°. Narrowing down that range will help the architecture to perform better (Additional file 8).
Conclusion and future work
The implementation of Q Learning and Deep Q Network as a controller in the Gazebo Robot Model was shown in this paper. It showed the details of the algorithms. However, some further improvements can be made. Like, It was assumed that the robot would work on Markovian State space, which generally not the case. In general, Inverted pendulum models are Nonmarkovian models. So there must exist some dependencies among the states. So In future, Recurrent Neural Network has a great possibility. Moreover, ten predefined values of velocities for action were used. In the real world application, action values have continuous range. So for more complex models, this method may not work. In that case, deep reinforcement learning algorithms with continuous action space like ActorCritic Reinforcement Learning algorithm [10] can be used. Finally, this work should be improved toward realworld scenarios.
Notes
Authors' contributions
The original project is this paper and [1]. The contributions of MDMR is the simulations and writing of this paper. The contributions of SMHR and MMH is reviewing both papers. All authors read and approved the final manuscript.
Competing Interests
The authors declare that they have no competing interests.
Funding
The paper has no external source of funding.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
 1.Rahman MDM, Rashid SMH, Hassan KMR, Hossain MM. Comparison of different control theories on a two wheeled self balancing robot. In: AIP conference proceedings, 1980; 1: 060005. 2018. https://aip.scitation.org/doi/abs/10.1063/1.5044373.
 2.Tai L, Liu M. Mobile robots exploration through cnnbased reinforcement learning. Robot. Biomim. 2016;3(1):24. https://doi.org/10.1186/s406380160055x.CrossRefGoogle Scholar
 3.Zamora I, Lopez NG, Vilches VM, Cordero AH. Extending the openai gym for robotics: a toolkit for reinforcement learning using ROS and gazebo. CoRR, vol. abs/1608.05742, 2016. http://arxiv.org/abs/1608.05742.
 4.Tran LD, Cross CD, Motter MA, Neilan JH, Qualls G, Rothhaar PM, Trujillo A, Allen BD. Reinforcement learning with autonomous small unmanned aerial vehicles in cluttered environments. In: 15th AIAA aviation technology, integration, and operations conference, Jun 2015. https://doi.org/10.2514/6.20152899.
 5.Sereda V. Machine learning for robots with ros, Master’s thesis, Maynooth University. Maynooth, Co. Kidare, 2017.Google Scholar
 6.Border R. Learning to save lives: Using reinforcement learning with environment features for efficient robot search. White Paper, University of Oxford, 2015.Google Scholar
 7.Watkins CJ. Learning from delayed rewards. Ph.D. dissertation, Kings’s Collenge, London, May 1989.Google Scholar
 8.Watkins CJCH, Dayan P. Qlearning, Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. https://doi.org/10.1007/BF00992698.
 9.Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA. Playing atari with deep reinforcement learning, CoRR, vol. abs/1312.5602, 2013. [Online]. Available: http://arxiv.org/abs/1312.5602.
 10.Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning, In: Proceedings of The 33rd International Conference on Machine Learning, ser. In: Proceedings of Machine Learning Research, Balcan MF, Weinberger KQ, (eds), vol. 48. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 1928–1937. http://proceedings.mlr.press/v48/mniha16.html.
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.