DM-DQN: Dueling Munchausen deep Q network for robot path planning

In order to achieve collision-free path planning in complex environment, Munchausen deep Q-learning network (M-DQN) is applied to mobile robot to learn the best decision. On the basis of Soft-DQN, M-DQN adds the scaled log-policy to the immediate reward. The method allows agent to do more exploration. However, the M-DQN algorithm has the problem of slow convergence. A new and improved M-DQN algorithm (DM-DQN) is proposed in the paper to address the problem. First, its network structure was improved on the basis of M-DQN by decomposing the network structure into a value function and an advantage function, thus decoupling action selection and action evaluation and speeding up its convergence, giving it better generalization performance and enabling it to learn the best decision faster. Second, to address the problem of the robot’s trajectory being too close to the edge of the obstacle, a method of using an artificial potential field to set a reward function is proposed to drive the robot’s trajectory away from the vicinity of the obstacle. The result of simulation experiment shows that the method learns more efficiently and converges faster than DQN, Dueling DQN and M-DQN in both static and dynamic environments, and is able to plan collision-free paths away from obstacles.


Introduction
With the development trend of artificial intelligence, the robot 2 industry is also developing towards the intelligent direction 3 of self-learning and self-exploration [1]. The 1 get the information of the whole environment. Therefore, 10 the traditional path planning algorithm cannot meet the needs 11 of people, such as artificial potential field algorithm [2,3], 12 ant colony algorithm [4], genetic algorithm [5], and particle 13 swarm algorithm [6].
14 For the problem, deep reinforcement learning (DRL) is 15 proposed [7,8]. DRL combines deep learning (DL) [9] with 16 reinforcement learning (RL) [10]. Deep learning focuses on 17 the extraction of features from the input unknown environ-18 mental states by means of neural network to achieve a fit 19 between the environmental states and the action value func-20 tion. Reinforcement learning then completes the decision 21 based on the output of the deep neural network and the explo-22 ration strategy, thus enabling the mapping of states to actions. 23 The combination of deep learning and reinforcement learning 24 solves the dimensional catastrophe problem posed by state- 25 to-action mapping [11] and better meet the needs of robot 26 movement in complex environment. 27 based on an artificial potential field, the robot's path plan-81 ning is kept away from the vicinity of the obstacle. Moving 82 obstacle in unknown environments can be a huge challenge 83 for mobile robot as they can negatively affect the range of the 84 sensors. Therefore, not only the path planning problem in the 85 static obstacle environment is studied, but also the path plan-86 ning problem in the dynamic and static obstacle environment 87 is considered. Finally, the DM-DQN algorithm is applied to 88 mobile robot path planning and is compared with the DQN, 89 Dueling DQN and M-DQN algorithms. 90 A summary of the key contributions of the paper are as 91 follows:

92
• A virtual simulation environment has been constructed 93 using the Gazebo physical simulation platform, replacing 94 the traditional raster map. The physical simulation plat-95 form is a simplified model of the real world that is closer 96 to the real environment than a raster map, reducing the gap 97 between the virtual and real environment and reflecting 98 whether the strategies learned by the agent will ultimately 99 be of value to the real robot problem.

100
• The network structure of the M-DQN is decomposed into a 101 value function and an advantage function, thus decoupling 102 action selection and action evaluation, so that the state 103 no longer depends entirely on the value of the action to 104 make a judgment, allowing for separate value prediction. 105 By removing the influence of state on decision making, 106 the nuances between actions are brought out more, allow-107 ing for faster convergence and better generalization of the 108 model.

109
• The negative impact of obstacle is considered and an arti-110 ficial potential field is used to set up a reward function 111 to balance obstacle avoidance and approach to the target, 112 allowing the robot to plan a path away from the vicinity of 113 the obstacle.

114
The structure of the paper as follows: "Theoretical back-115 ground" introduces the mobile robot model; "Proposed 116 algorithm" introduces the proposed DM-DQN algorithm in 117 detail; "Materials and methods" describes the simulation 118 environment and performs an experimental comparison; and 119 "Experiments and results" concludes the paper.   In the classical Q-learning algorithm, the iterations of the 131 q-function can be expressed by the following formula: where s t denotes the state at time t, a t denotes the action at 134 time t, η denotes the ratio coefficient, r t denotes the reward 135 at time t, s t+1 denotes the state at time t + 1, a denotes 136 the next action and γ denotes the discount factor. However, 137 q * is unknown in practice, so the value function q t of the 138 current strategy can only be used to replace the value function 139 q * of the optimal strategy, the process often referred to as 140 bootstrapping. In short, it is leading itself to be updated to 141 q t+1 by the current value of itself, q t .

142
In M-DQN, a guiding signal, "log-policy", which is dif-143 ferent from q t , is proposed, that is, the probability value of 144 the policy is taken as log. Since there is an argmax opera-145 tion in Q-learning, all optimal strategies are determined, so 146 the probability is 1 for each optimal strategy and 0 for all 147 other non-optimal strategies. After taking log for the strate-148 gies, the probability of the optimal strategy becomes zero, 149 while the probability of the remaining non-optimal strategies 150 becomes negative infinity. This is certainly a stronger signal  However, the value of lnπ (a t |s t ) is not computable in Q-159 learning, so the same maximum entropy strategy as in the 160 Soft-AC algorithm is introduced in DQN, which becomes 161 Soft-DQN. In Soft-DQN, not only the return value of the 162 environment is maximized, but also the entropy of the strat-163 egy, and the regression objective of Soft-DQN is expressed 164 as where s denotes the state, a denotes the action, r denotes the 168 reward value, and γ denotes the discount factor. π θ satisfies 169 π θ sm(q θ /τ ), τ is the temperature parameter, which is 170 used to control the weight of entropy, a denotes the action at 171 moment t + 1, and A is the action available. Since the policy 172 chosen for Soft-DQN is softmax, which is different from the 173 deterministic policy of argmax in Q-learning, the policy of 174 Soft-DQN is random and it is possible to calculate the "log-175 policy" guiding signal in M-DQN. Therefore, M-DQN makes 176 some simple modifications to Soft-DQN, which replaces r t 177 in Eq. (2) with r t + ατ lnπ θ (a t |s t ), i.e., where π θ sm(q θ /τ ), retrieved by setting α 0. M-DQN 182 not only maximizes the environmental reward while selecting 183 a strategy each time, but also minimizes the Kullback-Leibler 184 divergence [27] of the old and new strategies, which is consis-185 tent with the ideas of TRPO [28] and MPO [29]. Minimizing 186 the Kullback-Leibler divergence of the old and new policies 187 can lead to an improvement in M-DQN performance, mainly 188 due to the following two aspects:  In M-DQN, when we need to update the Q value of an 241 action, we update the Q network directly so that the Q value 242 of the action is raised. The Q network of the M-DQN can be 243 understood as fitting a curve to the Q value of the Q-table. 244 A cross-section can be taken that represents the Q value of 245 each action in the current state. For example, as shown in 246 Fig. 4a, when the M-DQN is updating the value of action 2 247 in the state, it will only update the action. In the DM-DQN, 248 the network gives priority to updating the V value because of 249 the restriction that the sum of the A values must be zero. The 250 V value is the average of the Q values and the adjustment 251 of the average is equivalent to updating all the Q values in 252 that state at once. Therefore, when the network is updated, 253 it not only updates the Q value of a particular action, but 254 adjusts the Q values of all actions in that state, all at once. 255 In Fig. 4b, when action 2 in the state is updated, the V value 256 is first updated, and because the average value is updated, 257 the rest of the actions in the state follow. As a result, it is 258 possible to have more values updated less often, resulting in 259 faster convergence and the ability to learn the best decisions 260 faster.

261
The DM-DQN is applied to robot path planning, and the 262 value function is to learn the situation where the robot does 263 not detect an obstacle, while the advantage function is to 264 where A is the optional action, ω is the common parameter of The artificial potential field method is a virtual force method 302 that treats the motion of a robot in its environment as a 303 motion under a virtual artificial force field [31]. As shown 304 in Fig. 6, the target point will exert a gravitational force on 305 the robot, while the obstacle will exert a repulsive force on 306 the robot. The resultant force of these two forces is the con-307 trolling force for the robot's motion, and with the controlling 308 force, a collision-free path to the target point can be planned. 309 The gravitational force on the robot becomes greater as it 310 approaches the target point, and the repulsive force increases 311 as it approaches the obstacle.

312
In the artificial potential field, the potential function U is 313 used to create the artificial potential field, where the gravita-314 tional potential function is expressed as follows: In Eq. (6), ζ denotes the gravitational potential field con-317 stant and d(q, q goal ) denotes the distance between the current 318 point q and the target point q goal .

319
The expression for the repulsive potential function is as 320 follows: generates a repulsive force, which is less than this threshold 326 before a repulsive force is generated.

327
The expression for the combined potential force on a 328 mobile robot in an artificial potential field is as follows:

330
The potential function U q of the mobile robot at point 331 q represents the magnitude of the energy at that point, and 332 the force vector at that point is represented by the gradient 333 ∇U (q), which is defined as The gravitational function at point q can be obtained by 336 finding the negative derivative of Eq. (6) and its expression 337 is expressed as The repulsive function at point q can be obtained by find-340 ing the negative derivative of Eq. (7) and its expression is 341 expressed as The combined forces on a mobile robot in an artificial 344 potential field can be expressed as  14) and (16), the total reward func-388 tion can be expressed as 389 reward reward att + reward rep + reward yaw Therefore, the reward function for the mobile robot is 392 expressed as a whole as where r goal denotes the radius of the target area centered on 395 the target point and r obs denotes the radius of the collision 396 area centered on the obstacle.

397
In order to verify the effectiveness of the reward func-398 tion setting based on the artificial potential field, only the 399 distance information between the mobile robot and the tar-400 get point will be considered as the reward function setting 401 for comparison, and its reward function setting is shown as 402 follows: The process of path planning algorithm based 405 on DM-DQN 406 The algorithm proposed in the paper first estimates the Q 407 value through an online Dueling Q network with weight θ , 408 and weight θ is replicated in a target network with weight θ 409 at each passing C steps. Second, by interacting with the envi-410 ronment using a ε-greedy strategy, the robot obtains reward 411 and the next state according to the reward function based 412 on artificial potential field. Finally, the transitions (s t , a t , 413 r t , s t+1 ) are stored in a fixed size FIFO replay buffer and 414 with each F steps, DM-DQN randomly draws a batch of D t 415 from the replay buffer D and minimizes the following losses 416 according to the regression objective of Eq. (8). The complete 417 algorithm process is shown in Algorithm 1. The parameters shown in Table 1 were used throughout the 443 experiment.

Experimental environments used in the experiments 445
Robot operating system 446 This paper uses the ROS (robot operating system) software 447 platform, which is a robot software development platform 448 that includes system software to manage computer hardware 449 and software resources and provide services. ROS uses a 450 cross-platform modular communication mechanism, which 451 is a distributed framework (Nodes) and largely reduces the 452 code reuse rate. ROS is also highly compatible and open, 453 providing a number of functional packages, debugging tools 454 and visualization tools.

455
The ROS node graph for the experiments is shown in 456 Fig. 7. Gazebo will publish information on a range of topics 457 such as odometry and LIDAR. In addition, the DQN algo-458 rithm communicates with these topics, the feedback from the 459 environment can be obtained, and the strategy can be learned 460 to output the actions to be performed and passed to the gazebo 461 environment. This allows the algorithm to interact with the 462 simulation environment.  four algorithms, as shown in Fig. 10, we find that the remain-499 ing three algorithms all rise faster than DQN after 100 rounds, 500 which is because the network structure adopted by Dueling 501 DQN can update multiple Q values at once; while M-DQN 502 is due to the introduction of maximum entropy, the addition 503 of maximum entropy makes the strategy more random, so 504 it will add more exploration, thus can speed up subsequent 505 learning; DM-DQN adopts a competitive network structure 506 compared to M-DQN, decoupling action selection and action 507 evaluation makes it have a faster learning rate, so it can make 508 fuller use of the experience of exploring the environment in 509 the early stage, and thus obtain a greater reward. As can be 510 seen in Fig. 10, the reward obtained by DQN converges to 511 2000, the reward of Dueling DQN and M-DQN converges 512 to 4000, while the reward value of DM-DQN converges to 513 7000. Therefore, the DM-DQN proposed in this paper is able 514 to obtain a larger reward value compared to the remaining 515 three algorithms, which means that more target points can be 516 reached.

517
Seven points were designated for navigation in the test 518 environment, and the robot was expected to explore this 519 Fig. 9 The robot's reward for each epision based on four algorithms  Table 2, the convergence rates of 531 the algorithms are also compared and it can be seen that DQN 532 took 294 min to obtain a reward of 8000, Dueling DQN took 533 148 min, M-DQN took 127 min and DM-DQN took 112 min. 534 DM-DQN converged faster than the other algorithms and 535 took less time to reach the target point. 536 Figure 11 shows the effect of two different reward func-537 tions for path planning, where the reward function in (a) only 538 considers the distance between the robot and the target point; 539 (b) is the reward function proposed in this paper. From the 540 figure, we can see that the paths in (b) are smoother and the 541 planned paths are farther away from obstacles, which greatly 542 reduces the probability of collision for the robot in a real 543 environment.

544
Dynamic and static environment 545 In the dynamic and static environment, we still randomly gen-546 erated seven target points to test the effect of DQN, M-DQN, 547 Dueling DQN and DM-DQN in the environment. Compared 548 to the static environment with two moving obstacles, the 549   Fig. 12, where the black object 551 is the moving robot, the brown objects are the static obsta-552 cles and the white cylinders are the moving obstacles, which 553 move in a randomized direction.

554
The DQN algorithm, the Dueling DQN algorithm, the M-555 DQN algorithm and the DM-DQN algorithm were also used 556 for the path planning task in a dynamic and static environment 557 and their convergence rates were compared. The cumulative 558 rewards for each round and the average rewards for the agent 559 are recorded in Fig. 13, and Fig. 14 compares the four algo-560 rithms. Unlike the static environment, the reward values of 561 the DQN, Dueling DQN, and M-DQN algorithms did not 562 rise significantly after 100 rounds. The upward trend occurs 563 at round 150, which is caused by the inclusion of dynamic 564 obstacles, while the DM-DQN proposed in the paper still 565 starts to converge at around round 120, indicating its good 566 generalization ability compared to the other algorithms. 567 Table 3 also compares the average number of moves to 568 reach a target point, the number of successful arrivals and 569 the success rate of reaching the target point for 320 rounds, 570 as the performance of all four algorithms decreases with the 571 inclusion of dynamic obstacles. The table shows that DM-572 DQN still has the lowest average number of moves, with 573 in the paper can effectively solve the problem of dynamic 591 obstacles, because the reward function in the paper takes into 592 account the distance from the obstacles, which enables the 593