Introduction

The relevance and utility of robot navigation in diverse applications have increased in recent years [1, 2]. Robots are currently used to navigate autonomously in spaces such as hospitals, airports, malls, industries, hotels, etc. [3,4,5,6]. The main purpose of deploying autonomous robots is their ability to operate in unknown environments and navigate without human intervention or only slight intervention [7, 8]. Static environments are navigated using traditional navigation techniques, but obstacles that are dynamic still pose challenges [9]. Nevertheless, intelligent mobile robots must be capable of navigating complex environments [10, 11]. In complex scenarios, combining human motion prediction with robot motion planning can be challenging due to humans’ unpredictable and dynamic nature [12]. The robots must be capable of evaluating, comprehending, and predicting future situations to avoid colliding with objects in a dynamic environment. However, effective autonomous navigation includes mapping and localization, human detection, social rules, and decision-making, making it a challenging task [13].

Prior studies have used robot path planning in conjunction with the prediction of future human trajectory to avoid overlap between human and robot path [14,15,16]. The methods, however, encountered some major problems, including freezing robot scenarios. In this case, the robot freezes or oscillates in dense crowds deciding that all paths are unsafe [17, 18].

Recent research has demonstrated reinforcement learning alongside deep learning [19], is an effective method of creating socially compliant robot navigation systems [20,21,22]. The learning-model methods present human-aware navigation as a Markov decision process (MDP) and use deep V-learning, where the agent chooses an action based on the state value approximated by the neural networks [23, 24]. By maximizing the total reward of the action, the deep reinforcement learning (DRL) technique selects the optimal policy for leading the robot through environmental interaction [25]. The robot learns the collective importance of neighboring humans to their future states. However, in the real-world scenario when the goal is located at a farther distance, these methods are ineffective since the robot made detours to find the optimal path. [26, 27]. A map-based mechanism is also introduced for the robot to take safe actions within indoor environments [28].

Most of the existing DRL-based social navigation methods assume homogeneity of agents, i.e., that humans behave similarly [21, 26, 29]. Also, they assume that the human transition states are well known and well defined. These states are, in fact, difficult to model due to the unpredictable nature of human behavior, i.e., humans move at different speeds based on their physical conditions, activities, and even their size. Therefore, it is still challenging to develop new techniques that can better account for the heterogeneity of human behavior in the design of socially compliant robot navigation systems.

To solve the aforementioned issues, the aim of this proposed work is to design a reward function that considers the safety of the surrounding environment and simplicity to lead the robot to take action for collision-free navigation. In the proposed method, dynamic warning zones are formulated to enable the robot to predict and anticipate the diversity of human behavior at the current state, thus the robot takes actions that lead it to avoid these zones for safe navigation. To lead the robot to minimize the arrival time, the reward function also evaluates the action taken by the robot at the current time and encourages it to take a route that forwards it closer to the desired goal.

Accordingly, the contribution of this study is as follows:

  • Robots are trained with the dynamic warning zones around humans based on their size, speed, and the length of steps they take, to avoid collisions. The warning zone is a circular space that the robot avoids crossing to keep a comfortable distance from humans.

  • A proposed reward function evaluates the distance between the robot’s position and the goal to reach a short-distance goal. A positive reward is given to the robot when it gets closer to the goal, and a negative reward when it moves away from the goal.

  • The proposed navigation system model is integrated into the robot operating system (ROS) for extensive evaluation in three different simulated and real scenarios: empty space, static obstacles, and dynamic obstacles.

The dynamic warning zones and short-distance goals are formulated and later trained and tested on DRL-based navigation methods to lead the robot to take collision-free actions while considering the unpredictable human behavior in densely crowded environments. The paper is structured as follows: “Related work” begins with the related work, followed by the methodology in “Methodology”. The experimental setup, results, and evaluations are presented in “Results and discussion”. Finally, the conclusions are provided in “Conclusions”.

Related work

Crowd navigation is complex since the robot cannot directly acquire each human’s implicit goal since the environment is partially observable [10, 30]. The recent research on this problem can be roughly divided into the following two categories:

Reaction-based models

Using attractive and repulsive forces, the Social Force model [31] is one of the earliest models developed for improving navigation in crowded environments. Other reaction-based models such as reciprocal velocity obstacles (RVO) [32], and optimal reciprocal collision avoidance (ORCA) [33] use optimal collision-free velocities for obstacle avoidance in multi-agent navigation. These models have the drawback of being computationally intensive since well-engineered parameters are required.

Learning-based methods

DRL algorithms are currently being used in the field of robotics to directly learn optimal control strategies from visual inputs for various real-world issues [34]. The popularity of DRL as an end-to-end approach has led to it being used to learn collaborative behavior in crowd-aware scenarios [35]. DRL is also used for socially compliant obstacle-aware avoidance [29, 36, 37]. Additionally, imitation learning can be used to discover policies based on demonstrations of desired behaviors [38, 39]. Human interactions are addressed using Collision Avoidance with DRL [23]. The socially aware DRL approach has been extended with long short-term memory (LSTM) architecture [24], which converts each human state into a fixed-length vector in reverse order of distance from the robot. All these methods use deep V-learning, where the planner calculates the value of all possible next states. Based on the agents’ state transitions, the planner chooses an action that leads to the highest valued state. Supervised and reinforcement learning are combined by considering human behaviors in human–robot interaction [26, 40, 41]. To fully define the spatial–temporal interactions between pedestrians and robots for autonomous navigation in a dynamic environment, the spatial–temporal state transformer is created [42]. Despite this, all of the approaches mentioned above do not perform well when presented with complex environments that include real-time human behavior. Using the decentralized collision avoidance method, a reinforcement learning framework is proposed where each agent makes its own decisions independently without communicating with any other agent [43].

Although DRL-based methods have demonstrated improvement in performance in crowded environments, they still suffer from a number of drawbacks. These include not adapting to human behavior by considering their size and speed, or not finding a path in some situations. Therefore, limiting the application of these methods to real-world situations to a substantial extent.

This manuscript proposes a method by which a robot can learn to adapt its behavior by considering comfort and safety. To accomplish this, dynamic warning zones are created as circular sectors around humans. To find an efficient and smooth path, a short-distance goal is integrated with DRL to intelligently select waypoints, thus aiding the robot in reaching its goal by maintaining an optimized course.

Methodology

Problem formulation

In this work, the crowd-aware robot navigation problem is formulated as a sequential decision-making problem in a reinforcement learning framework. The robot interacts in an environment comprising of n number of humans, where all agents (humans and robot) move in a two-dimensional plane. The states of agents at any time t consist of position \(\bar{p}=[x,y]^T\), velocity \(\bar{v}=[v_x,v_y]^T\), and the radius r. They constitute observable states that can be observed by all the agents. On the other hand, the unobservable states include the goal position \(\bar{g}=[g_x,g_y]^T\), preferred velocity \(\bar{v}_{p}\) of each agent with its magnitude as \(v_p\), and orientation \(\theta _\rho \). Here, the subscript p represents the preferred parameters, while subscript \(\rho \) represents the robot parameters.

The robot state \(O_{\rho }\) at any time t is given by:

$$\begin{aligned} O_{\rho }(t)=\left[ \bar{p}_{\rho },\bar{v}_{\rho },r_{\rho },\bar{g}_{\rho },v_{p\rho },\theta _{\rho }\right] , \end{aligned}$$
(1)

and the each human’s state \(O_{h}\) is given by:

$$\begin{aligned} O_{h}(t)=[\bar{p}_h,\bar{v}_h,r_{h},v_{ph}], \end{aligned}$$
(2)

where h is the index for each human and has a value of \(1<h<n\). From a geometric perspective and for better parametrization, a robot-centric system is used, with the origin at its current position, and the x-axis pointing towards the goal [23]. Such a system is given by:

$$\begin{aligned} O_{\rho }(t)=\left[ \bar{d}_{\rho g},\bar{v}_{\rho },r_{\rho },v_{p\rho }\right] , \end{aligned}$$
(3)

where \(\bar{d}_{\rho g}\) represents the distance of the robot from the goal and is given as \(\bar{d}_{\rho g}=\parallel {\bar{p}_{\rho }} - {\bar{g}}\parallel _2\), and:

$$\begin{aligned} O_{h}(t)=[\bar{d}_{h},\bar{p}_h,\bar{v}_h,r_{h}], \end{aligned}$$
(4)

where \(\bar{d}_{\rho h}=\parallel {\bar{p}_{\rho }} - {\bar{p}_h}\parallel _2\) is the robot’s distance to neighbor h.

The observable robot state and human state are combined to form a Joint State J of the system at time t, which is given by:

$$\begin{aligned} J(t)=[O_{\rho },O_1,...,O_{h},...,O_{n}]. \end{aligned}$$
(5)

At each time-step t, the robot takes an action \(a(t)\in A\), where A represents the action space, according to its policy \(\pi (J)\). As a result, the robot receives a reward of R(Ja) for selecting that action. The optimal value function \(V^*(J)\) is calculated by accumulating the expected rewards at each time-step t given an action performed and is computed as:

$$\begin{aligned} V^*(J)=\sum \limits _{k=o}^T \gamma ^{k\Delta tv_{p}} R(J,a^*), \end{aligned}$$
(6)

where \(\gamma \) is a discount factor and its value is given as \(0\le \gamma \le 1\), and \(a^*\) is the optimal action. As mentioned before \(v_{p}\) is the preferred velocity, and here it is used as a normalization factor for numerical reason [26]. \(R(J,a^*)\) is the total accumulated reward at each time-step t.

Fig. 1
figure 1

The overview of deep reinforcement learning framework

With an optimal value function \(V^*(J)\) as provided in (6), the optimal policy can be calculated by choosing the optimized value, i.e., selecting an action with maximum cumulative reward, as provided below:

$$\begin{aligned} \begin{aligned} \pi ^{*}(J)&= \underset{a}{\arg \max }R(J,a) \\&\quad + \gamma ^{k\Delta tv_{p}}\int _{J^{+}} P(J,a,J^{+})\textrm{d} J^{+}, \end{aligned} \end{aligned}$$
(7)

where \(P(J,a,J^{+})\) is the probability of the unknown state transition from t to \(t^+\). The integral in (7) is approximated by assuming that the human travels with a constant velocity for a short time interval t to \(t^+\), where \(t^+ = t+\Delta t\). Therefore, the next joint state can be estimated, and the optimal policy is simplified as:

$$\begin{aligned} \pi ^{*}(J) = \underset{a}{\arg \max }\ R(J,a) + \gamma ^{k\Delta tv_{p}} V^*(J^{+}). \end{aligned}$$
(8)
Fig. 2
figure 2

Dynamic warning zones based on zero to the maximum velocity of the humans with different radii

Figure 1 shows the DRL architecture. The observations such as position and velocity are provided to the deep neural network framework. With a pre-trained value network N, a path is planned by selecting actions that maximize the cumulative reward. The proposed reward function creates an optimal path by incorporating dynamic warning zones and short-goal distances to ensure comfort and safety. Additionally, three different groups of humans are simulated in the training environment. These groups include adults over 18 years of age, adolescents between 12 and 17 years of age, and children between 5 and 12 years of age [44]. The sector’s radii and speeds are assigned to each group during the training process depending on their step length and speed.

Algorithm design

A detailed description of the proposed algorithm is provided in this section.

Dynamic warning zones

To emphasize navigational safety, a realistic dynamic environment is modeled by incorporating human behavior patterns. Human behavior and speeds differ based on the characteristics of each individual. Therefore to simplify human behavior, three groups are selected: adults, adolescents, and children. During training, the diameters of the circles represent the size of the person. Thus, circles with diameters of 0.41–0.6 m are used to represent adults, 0.31–0.4 m for adolescents, and 0.2–0.3 m for children. Based on various activities, such as walking slowly or rapidly, or running, the velocity varies from 0.5 to 0.6 m/s, 0.61 to 0.8 m/s, and 0.81 to 1.8 m/s, respectively [44, 45]. To accommodate the robot’s hardware limitations, the human velocity is limited to 1.8 m/s for this study.

The dynamic warning zones are circular sectors with radius \(r_\textrm{wz}\) based on human velocity \(v_\textrm{h}\) and the step length \(h_\textrm{gait}\), given by:

$$\begin{aligned} r_\textrm{wz} = 0.8 v_\textrm{h} + h_\textrm{gait}. \end{aligned}$$
(9)
Fig. 3
figure 3

A local goal \(l_\textrm{g}\) is selected on the global path at time t, within the range of look-ahead distance \(d_\textrm{la}\) centered at the robot’s position

A gain of 0.8 is applied to reduce the elongation of the tangent lines with the maximum increase in velocity. The value is found through experimentation. The values assigned to \(h_\textrm{gait}\) are calculated by measurement trial. Therefore, the maximum length each individual covers in one step corresponds to 0.8 m for an adult, 0.65 m for an adolescent and 0.45 m for a child. Figure 2 shows the geometry of the dynamic warning zone.

Warning zone angle is formulated based on [46], where \(\phi _\textrm{wz}\), decreases exponentially with magnitude of human velocity, \(v_\textrm{h}\), and is expressed as follows:

$$\begin{aligned} \phi _\textrm{wz} = 2.2\pi e^{-1.7v_\textrm{h}} + 0.2\pi . \end{aligned}$$
(10)

This has a difference with [46], in terms of the factors of 2.2, 1.7, \(0.2\pi \). These values are based on experimental trials to increase the angle of the zone, and subsequently formalize a reward that penalizes as a safety precaution. This allows the robot to react faster and anticipate the movement of humans from far distances, thus avoiding collisions with humans and safety.

Short-distance goal

The primary objective of robot navigation is to reach the goal safely and efficiently. Based on the Dijkstra algorithm [47], the robot follows a trajectory from the initial point to the goal. To reach the goal, the algorithm generates waypoints based on the occupancy grid map. A local short-distance goal, \(l_\textrm{g}\), is introduced to ensure the robot follows the waypoints even if it deviates from avoiding a dynamic or static obstacle. Figure 3 shows the short-distance goal.

The robot’s local goal is set to be the closest waypoint within the range of look-ahead distance \(d_\textrm{la}\) and is based on the robot’s current position \(\bar{p}_\rho \). This procedure updates in the joint state J and continues until the robot’s local goal coincides with its final goal. The robot’s trajectory is also evaluated by a reward function, if it is approaching the goal by \(d_{\rho g}(t)\) and \(d_{\rho g}(t^-)\), corresponding to the distance from the robot’s position to the goal in current time-step t and the previous time-step \((t^-)\). The reward function encourages the robot to move towards the goal and penalizes it if it chooses a different course that leads it away from the goal.

Reward function

The reward formulation in DRL models for social navigation penalizes only collisions or uncomfortable distances [26]. The proposed reward function takes into account the safety between humans and robots and at the same time encourages the robot to continue approaching the goal, thereby minimizing navigation time. According to the proposed reward formulation, once the robot has reached the goal position, it will receive a positive reward of 10. The robot, however, is penalized with \(-0.1 \) for the actions that lead to a collision. In the event that the robot navigates beyond the distance tolerable by humans, the robot is penalized further.

As mentioned before, the robot is penalized if it exceeds the distance that is acceptable to human safety while navigating, as:

$$\begin{aligned} 0.25(d_{\rho h} - d_\textrm{min}),~~\text {if}~~ 0<d_{\rho h}<d_\textrm{min}, \end{aligned}$$

where \(d_{\rho h}\) represents the robot’s distance to neighbor h, and \(d_\textrm{min}\) is discomfort factor which is set to 0.2 [35].

Another reward function is formulated based on the dynamic warning zone, where \(r_\textrm{wz}\) is the radius of the dynamic warning zone, and \(r_\textrm{h}\) is the human radius accordingly to the group of the human. The robot is penalized if it exceeds \(d_{\rho h} \le r_\textrm{wz}-r_\textrm{h} \) as:

$$\begin{aligned} 0.2 \left( e^{(d_{\rho h} -r_\textrm{wz} -r_{h})} - 0.3\right) . \end{aligned}$$

To encourage the robot to continue moving towards the goal, another reward function is established based on a short-distance goal by calculating the difference between the current time-step t and the previous time-step \(t^-\). This value is multiplied by a constant gain of 0.01 formulated based on experimental trials, given by:

$$\begin{aligned} 0.01\left( {d}_{\rho g}(t^-) - {d}_{\rho g}(t)\right) , \end{aligned}$$

where \({d}_{\rho g}(t^-)\) and \({d}_{\rho g}(t)\) denote the robot’s distance to the goal \(\bar{d}_{\rho g}=\parallel {\bar{p}_{\rho }} - {\bar{g}_\rho }\parallel _2\) at time \(t^-\) and t.

During each time-step, the robot’s progress is evaluated. The robot is rewarded if it advances towards the goal position and is penalized if it diverges from its goal.

The final reward \(R_\textrm{wz}\) is calculated by combining all the aforementioned reward functions as follows:

$$\begin{aligned} R_\textrm{wz}=\left\{ \begin{array}{@{}ll@{}} 10, &{} \text {if}\ \text {reach the goal} \\ -0.1, &{} \text {if}\ d_{\rho h}<0 \\ 0.25(d_\textrm{min} - d_{\rho h}), &{} \text {if}\ 0<d_{\rho h}<d_\textrm{min} \\ 0.2 \left( e^{(d_{\rho h} -r_\textrm{wz}-r_{h})} - 0.3\right) , &{} \text {if}\ {d_{\rho h} \le r_\textrm{wz}-r_{h}} \\ 0.01\left( {d}_{\rho g}(t^-) - {d}_{\rho g}(t)\right) , &{} \text {otherwise.} \\ \end{array}\right. \nonumber \\ \end{aligned}$$
(11)

Results and discussion

To improve clarity and organization, this section has been divided into multiple sub-sections. A comprehensive description of the training setup is provided first. A detailed analysis of both quantitative and qualitative evaluations is presented, followed by a statistical analysis of the results. Towards the end of this section, a detailed exposition of the implementation and experimental setup has been provided.

Training setup

To evaluate the approach of the proposed model, the robot is trained in an environment with five and ten simulated humans belonging to different categories as mentioned in the previous section. As part of the training and testing process, the humans navigate from random positions in circular and square crossing scenarios, where each human’s goal is the opposite of his initial position. Additionally, the robot navigates in two modes: invisible and visible. When the robot is in invisible mode, humans react only to the movements of other humans without taking into account the movements of the robot. In the visible mode, the human interacts according to the robot’s movement.

The model is trained for 10,000 episodes, with a learning rate of 0.001 and a discount factor \(\gamma \) set to 0.9. In the first 4000 episodes, the exploration rate \(\epsilon \) of the greedy policy decreases linearly from 0.5 to 0.1 and then remains unchanged for the remaining episodes. The value network is reinforced using mini-batches of size 100 after every 1000 episodes. The robot is tested in 500 test scenarios with the same configuration. A comparison of the reinforcement learning parameters used in training and testing is presented in Table 1.

Table 1 Reinforcement learning parameters

The humans are represented by circles with random diameters from 0.2 to 0.6 m in reference to children, teens, and adults sizes. Humans’ speed is also random and varies between 0.5 and 1.8 m/s according to actions such as walking slowly, rapidly, and running. The time-step \(\Delta t\) is 0.25 s and the maximum navigation time is 25 s. The look-ahead distance, \(d_\textrm{la}\), of the local goal, is 2.5 m, and the total distance from the start point to the endpoint is 8 m.

The proposed method is compared with reward functions over three state-of-the-art DRL methods: collision avoidance with deep reinforcement learning (CADRL), long short-term memory (LSTM-RL), and social attention with reinforcement learning (SARL). The network frameworks of all methods are kept. The proposed reward function \(R_\textrm{wz}\) is also compared with three previously proposed reward functions \(R_{1} \) [26], \(R_{2}\) [24], and \(R_{3}\) [46] as provided in (12), (13), and (14), respectively.

$$\begin{aligned}{} & {} R_{1}=\left\{ \begin{array}{@{}ll@{}} 1, &{} \text {if}\ \text {reach the goal} \\ -0.25, &{} \text {else if}\ \text {collision} \\ -0.1 + d_{\rho h}/2, &{} \text {else if}\ \text {uncomfortable} \\ 0, &{} \text {otherwise.} \\ \end{array}\right. \end{aligned}$$
(12)
$$\begin{aligned}{} & {} R_{2}=\left\{ \begin{array}{@{}ll@{}} 1, &{} \text {if}\ \text {reach the goal} \\ -0.25, &{} \text {else if}\ \text {collision} \\ -0.1 + 0.05 d_{\rho h}, &{} \text {else if}\ \text {uncomfortable} \\ 0, &{} \text {otherwise.} \\ \end{array}\right. \end{aligned}$$
(13)
$$\begin{aligned}{} & {} R_{3}=\left\{ \begin{array}{@{}ll@{}} 1, &{} \text {if}\ \text {reach the goal} \\ -0.1, &{} \text {else if}\ \text {collision} \\ -d_{\rho h}\left( \frac{-0.1}{r_{dz}-0.3}\right) +0.1), &{} \text {else if}\ \text {danger zone} \\ 0, &{} \text {otherwise.} \\ \end{array}\right. \nonumber \\ \end{aligned}$$
(14)
Table 2 Quantitative evaluation results in the invisible and visible circular crossing scenarios with 5 humans
Table 3 Quantitative evaluation results in the invisible and visible square crossing scenarios with 5 humans

While previous reward formulations have encouraged the robot to avoid collisions during navigation, the robot disregards safety during the course of its behavior. The reward function \(R_3\) [46] imposes a penalty for entering a danger zone, however, it is based on the assumption that the agent is placed in an environment with dynamic obstacles of the same size, which idealizes the agent’s state and its surrounding crowd. Moreover, it does not evaluate the distance between the robot and the goal, resulting in a long time to reach the target and non-smooth routes.

Quantitative evaluations

Several metrics are used to evaluate robot navigation performance: success rate (SR), collision rate (CR), total time (T), path length (PL), and discomfort rate (DR). The SR measures the success rate without collision, whereas CR measures the rate of collision between the robot and the human. Total time (T) is the average time taken by the robot to reach its goal. PL is the average length of the path traveled by the robot in successful trials, and DR represents how often the robot crosses the warning zone in \(R_\textrm{wz}\) and \(R_{3}\), or crosses the discomfort distance in \(R_{1}\) and \( R_{2}\). The navigation performance with the proposed reward function is evaluated by calculating the success rate, collision rate, and discomfort rate.

Table 4 Quantitative evaluation results in visible circular and square crossing scenarios with 10 humans

Navigation performance

Navigation performance is evaluated in two modes: invisible and visible modes with circular and square crossing scenarios. Tables 2 and 3 list the quantitative evaluations of experiments with five humans in both modes and scenarios, respectively. SARL, CADRL, and LSTM are evaluated with \(R_1\), \(R_2\), \(R_{3}\) and the proposed \(R_\textrm{wz}\).

In the invisible mode, the performances of CADRL\((R_1)\) and CADRL\((R_2)\) are lowest as shown in Tables 2 and 3. With the proposed method, CADRL\((R_\textrm{wz})\) and CADRL\((R_3)\) improve their SR reaching 0.77. The lowest CR and DR are obtained by CADRL\((R_\textrm{wz})\).

On the other hand, the performance of LSTM-RL\((R_\textrm{wz})\) increases, achieving an SR of 0.96, with a low CR of 0.03 and a DR of 0.44 compared to previous LSTM-RL\((R_1)\), and LSTM-RL\((R_2)\) models. Unlike CADRL\((R_3)\) and LSTM-RL \((R_3)\), the CADRL \((R_\textrm{wz})\) and LSTM-RL\((R_\textrm{wz})\) present the lowest DR demonstrating more safety and comfort during the robot navigation.

As shown in Tables 2 and 3, the SR for the SARL baseline is the highest in circular and square scenarios. Furthermore, SARL\((R_1)\) and SARL\((R_2)\) are superior in terms of CR, with the lowest average of 0.02. However, in comparison with SARL\((R_3)\), the proposed SARL\((R_\textrm{wz})\) has a higher SR of 0.98 and a lower DR, demonstrating that the robot is less frequently entering discomfort distance with humans. Due to the invisible mode, the robot is very likely to have collisions and close crossovers.

Fig. 4
figure 4

Results from circular crossing scenario with all methods plotted in spider graph. a CADRL model results in invisible mode. b CADRL model results in visible mode. c LSTM-RL model results in invisible mode. d LSTM-RL model results in visible mode. e SARL model results in invisible mode. f SARL model results in visible mode

Fig. 5
figure 5

Results from square crossing scenario with all methods plotted in spider graph. a CADRL model results in invisible mode. b CADRL model results in visible mode. c LSTM-RL model results in invisible mode. d LSTM-RL model results in visible mode. e SARL model results in invisible mode. f SARL model results in visible mode

In the visible mode, the human interacts with the robot, and the SR and CR are improved in all the methods. CADRL\((R_\textrm{wz})\) performs better, achieving a high SR of 0.94 and a low CR of 0.06. Nevertheless, DR is still high, because CADRL does not consider complex interactions between agents, and the robot is more likely to take paths that lead it to enter a group of humans. LSTM-RL with rewards \((R_1)\), and \((R_2)\) show similar SR and CR. Though LSTM-RL\((R_3)\) keeps relative success and collision rates, LSTM-RL\((R_\textrm{wz})\) performed better by obtaining higher safety with a low DR.

Table 4 shows the quantitative evaluation results in visible circular and square crossing scenarios with 10 humans. It can be seen that the higher crowd density results in reduced performance for CADRL and LSTM-RL with rewards \(R_1\) and \(R_2\). CADRL\((R_\textrm{wz})\) and LSTM-RL\((R_\textrm{wz})\), however, enhance SR, reducing CR and DR. In contrast, LSTM-RL\((R_\textrm{wz})\) shows a higher DR when compared with DR with less crowd density, as shown in the table.

In terms of SR and CR, the SARL performs better in all scenarios. Nonetheless, SARL\((R_\textrm{wz})\) is able to achieve higher SRs of 0.99 and 0.98, and better CRs close to 0 in both circular and square scenarios.

SARL\((R_\textrm{wz})\) also has the lowest DR, demonstrating that the robot prioritizes safety. By adapting its navigation behavior for environments with humans and their variety of speeds, SARL\((R_\textrm{wz})\) shows the best performance in the different scenarios and crowd densities.

Fig. 6
figure 6

Local trajectory comparison with CADRL method in a random test episode with five humans in a circle crossing scenario. The experimental conditions have the same starting point, goal position, and time-step

Fig. 7
figure 7

Local trajectory comparison with lSTM-RL method in a random test episode with five humans in a square crossing scenario. The experimental conditions have the same starting point, goal position, and time-step

Fig. 8
figure 8

Local trajectory comparison with SARL method in a random test episode with ten humans in a circle crossing scenario. The experimental conditions have the same starting point, goal position, and time-step

Efficiency

The efficiency is assessed by the average time (T) taken to reach the goal by the robot and the average length of the path traveled for successful navigation (PL), supported by SR, CR, and DR. The results are presented in the Tables 2, 3 and 4.

As the proposed reward function also evaluates the trajectory of the robot, the PL is lower than the previous methods with rewards \(R_1\), \(R_2\), and \(R_3\) in all experimented scenarios, thus showing that the proposed method decreases the successful path length significantly. Furthermore, the proposed method consumes less time to reach the goal with fewer chances of collision as demonstrated by SR, CR, and DR as shown in the tables. Consequently, the proposed method requires less time and travels less distance. The efficiency of SARL\((R_\textrm{wz})\) is higher than all other performance metrics, due to the fact that the path maintained by SARL\((R_\textrm{wz})\) minimizes arrival time while achieving fewer collisions.

A complete overview of the performance of CADRL, LSTM-RL, and SARL in all metrics is given in Figs. 4 and 5. All metrics results are plotted in spider graphs, and the time (T) and path length (PL) are normalized for better visualization.

Qualitative evaluation

For qualitative evaluation, all the methods are tested under the same experimental conditions such as the same starting point, goal position, and time-step. Figure 6 shows the CADRL performance in a random test episode with five humans in a circle crossing scenario. CADRL\((R_\textrm{wz})\) stops and moves backward to avoid the humans surrounding it. The robot exits without collision and avoids the warning zones around the humans, taking 8.25 s to reach the goal. CADRL\((R_\textrm{wz})\) tends to follow an even path in comparison to CADRL with previous formulations.

On the other hand, in LSTM-RL\((R_\textrm{wz})\), the robot goes backward and turns right. Later, the robot is heading to find the path to reach the goal. Figure 7 illustrates the trajectory taken by the robot with all the formulations, demonstrating that with the proposed reward formulation, the robot reaches the goal with less time and a smooth path. It can be seen that at 8.75 s, the robot reaches the goal with LSTM-RL\((R_\textrm{wz})\); while with other methods, the robot is still on the way towards the goal.

Figure 8a shows how SARL\((R_1)\) makes greater deviations at first to avoid collisions with humans and therefore takes longer to find and reach the goal. During crossovers, SARL\((R_2)\) model attempts to locate a path through the middle, causing the robot to be uncomfortably near and at a higher risk of collision, as shown in Fig. 8b. SARL\((R_3)\) initially stops and backtracks, losing sight of the goal; finally, it chooses a path that intersects the humans, which leads it to do more maneuvers and takes more time to reach the goal, as is shown in Fig. 8c. The robot trajectory performed by SARL\((R_\textrm{wz})\) is shown in Fig. 8d. It is observed that SARL\((R_\textrm{wz})\) avoids all humans by a larger margin. The SARL\((R_\textrm{wz})\) behavior takes a larger detour to the left, estimating in advance the warning zones and managing to avoid front-on human contact.

As seen from the experiments SARL shows the performance when combined with the proposed reward formulation. The robot moves away from the humans and takes open routes to guarantee safe navigation.

Statistical analysis

Statistical analysis is performed to assess the proposed method over baseline methods. To ensure reliable data comparison, the Kolmogorov–Smirnov test [48] was used to determine normality. The test’s outcome revealed that several metrics had a non-Gaussian distribution, therefore, the non-parametric Wilcoxon signed-rank test is used to identify differences and significant levels. A test statistic value (i.e., z-score) is generated using Wilcoxon signed-rank, and this value is then converted into a p value. The smaller p value denotes a considerable variation between populations. The threshold of p value is 0.05, where if the p value is less than 0.05, pair 1 is less than pair 2, rejecting the null hypothesis and accepting the alternative hypothesis. The one-sided alternative hypothesis with the less option is taken into account. The results in Table 5 demonstrate that the success rate does not accept the null hypothesis; thus, the proposed method is more significant than other methods. In contrast, the metrics of collision rate, time, path length, and discomfort rate do not reject the null hypothesis, proving that the proposed method improves the metrics in comparison to previous methods.

Table 5 Statistical analysis of quantitative evaluation results using Wilcoxon signed-rank test
Fig. 9
figure 9

Performance comparison of SARL\((R_1)\), SARL\((R_2)\), SARL\((R_\textrm{wz})\) in an empty scenario in Gazebo-ROS. The red dots denote the path that the robot must follow. The purple dots show the trajectory performed by the robot

Implementation and experimental setup

Since the robot considers all surrounding humans’ dynamic warning zones, the risk of a robot freezing diminishes due to the proposed reward function. Based on prior training results, superior outcomes were obtained with the SARL method given by its collective significance of the crowd. Therefore, to assess the effectiveness of SARL with our proposed method, Gazebo simulations, and real-time experiments are conducted.

Table 6 Navigation time (seconds) in empty, static obstacles, and dynamic obstacles scenario on Gazebo-ROS simulator

The Turtlebot 3 is used for experimental tests, and it includes some hardware limitations to consider. Its limitations include the limited computational power of a Raspberry Pi 3, which constrains the computational capabilities to quickly process the actions of the robot. To address this limitation, the Raspberry Pi 3 is connected to a remote PC that processes real-time inputs related to the observable state of humans and the full state of the robot. Since the maximum velocity of the Turtlebot 3 is 0.23 m/s, the robot may be unable to take prompt action and avoid obstructions if the velocity of the human surpasses that of the robot. Therefore, to overcome this constraint, the robot was trained by considering that the maximum velocity the human can reach is 1.8 m/s. Originally, the Turtlebot 3 has a 2D laser distance sensor (LDS) with a sensor range of 3.5 m. To mitigate, the sensor range, the Turtlebot 3 was equipped with a Hokuyo LIDAR with a detection distance of 5.6 m and a 1% margin of error.

Gazebo-ROS simulation

The reward functions \(R_\textrm{wz}\), \(R_{1}\) and \(R_{2}\) from equations (11), (12) and (13), respectively, are evaluated in Gazebo simulations using ROS within three scenarios: empty space, static obstacles, and dynamic obstacles. The following packages are used for simulations including GMapping [49] for SLAM approach to obtain a 2D occupancy grid map of the environment by a 2D Laser Distance Sensor (LDS), AMCL package [50] for robot’s localization, and the leg-detector package [51] for the position and velocity of the dynamic obstacles.

Figure 9 shows the robot navigation in an empty scenario with the goal distance of 5 m. The red dots denote the path that the robot must follow. The purple dots show the trajectory performed by the robot. Table 6 shows the results comparing the SARL approach with the reward functions \(R_1\), \(R_2\), and \(R_\textrm{wz}\) within 5, 10, and 15 m of goal distance. For all three models, the robot navigates at a similar speed. As the distance to the goals increased, the navigation time also increased. It appears that SARL\((R_1)\) and SARL\((R_2)\) are slow to follow the path. Comparatively, their courses become irregular, unlike SARL\((R_\textrm{wz})\), which can keep steady navigation without changing directions.

Fig. 10
figure 10

Performance comparison of SARL\((R_1)\), SARL\((R_2)\), SARL\((R_\textrm{wz})\) in static obstacles scenario in Gazebo-ROS. The red dots denote the path that the robot must follow. The purple dots show the trajectory performed by the robot

Figure 10 shows the comparison of the SARL approach with the three reward functions in a static obstacle scenario. The goal distance in this case is 10 m. The red dots denote the path that the robot must follow. The purple dots show the trajectory performed by the robot. Table 6 shows the results with 5, 10, and 15 m of goal distances. SARL\((R_1)\) and SARL\((R_2)\) take a longer time to reach the goal as compared to SARL\((R_\textrm{wz})\). The performance of SARL\((R_\textrm{wz})\) is better even if the goal is farther away. Figure 10 shows that SARL\((R_1)\) and SARL\((R_2)\) change their trajectory, indicating that the models do not follow the global planner of the Dijkstra algorithm, while the SARL(\(R_\textrm{wz}\)) follows the proposed short-distance goal method. Accordingly, SARL\((R_\textrm{wz})\) demonstrates high performance by achieving the most efficient results in terms of time, path planning, and obstacle avoidance.

Fig. 11
figure 11

Performance comparison of SARL\((R_1)\), SARL\((R_2)\), SARL\((R_\textrm{wz})\) in dynamic obstacles scenario in Gazebo-ROS. The red dots denote the path that the robot must follow. The purple dots show the trajectory performed by the robot

Fig. 12
figure 12

Real-world experiments in three scenarios: empty, static obstacles, and with humans

Figure 11 shows the robot navigation with ten dynamic obstacles with 15 m of the goal distance. The red dots denote the path that the robot must follow. The purple dots show the trajectory performed by the robot. Table 6 shows the navigation results of the three scenarios for 5, 10, and 15 m of goal distance. To avoid the dynamic obstacles, SARL\((R_1)\) turns at a large trajectory or moves continuously in the opposite direction of the goal when it perceives an approaching obstacle, as shown in Fig. 11a. SARL\((R_2)\) model frequently freezes, requiring the robot to rotate in place to find an opportunity to escape, as shown in Fig. 11b. The \(R_\textrm{wz}\) allows the robot to react from a farther distance, preventing it from crossing over the circular sector around the dynamic obstacle. As a result, SARL\((R_\textrm{wz})\) demonstrates remarkable performance by avoiding dynamic obstacles and following the global path planner of the Dijkstra algorithm.

Table 7 Experimental results in empty, static obstacles and human scenario

Real-time experiments

This section presents the real-time environment for the experimental tests, robot, and sensors. The information about packages used to validate the performance of autonomous navigation based on DRL is also provided. The robot tries to reach the goal in three scenarios: empty space, obstacles, and humans at a distance of 8 m. The Turtlebot 3, with a maximum velocity of 0.23 m/s, is used with a Hokuyo 2D scanner sensor with a 360-degree resolution for mapping and obstacle detection. An odometer is used to localize the robot’s position. The GMapping package [49] is used for mapping. The AMCL package [50] is used in conjunction with the LiDAR and odometer for robot’s localization, and the obstacle-detector package [52] is used to identify static and dynamic obstacles.

In the initial deployment, the robot with SARL\((R_1)\) takes a long detour to the right, which causes it to veer off course and take longer to reach its destination, as shown in Fig. 12a. In comparison, using the proposed formulation SARL\((R_\textrm{wz})\), follows the route directly without detours, resulting in a quicker arrival time at the goal.

In the static obstacle scenario as shown in Fig. 12b, SARL\((R_1)\) collides with the obstacles and takes longer to reach the goal. whereas, with SARL\((R_\textrm{wz})\), the robot avoids obstacles and takes a path that leads to its goal without collision.

Figure 12c shows the scenario with humans. The robot with SARL\((R_1)\) is able to respect the comfortable distance and avoid colliding with people. However, on some occasions, the robot rotates in its own space and looks like it loses its path, resulting in a long time to reach the goal. Table 7 shows the experimental results in the three different scenarios, where SARL\((R_\textrm{wz})\) has a significantly lower arrival time, resulting in smoother trajectories and demonstrating fewer oscillations in comparison to SARL\((R_1)\). SARL\((R_\textrm{wz})\) is the model that performs well in the human–robot coexistence scenario. When the robot perceives the human from the front, it stops and quickly looks in another direction to continue with the desired route. In this way, the robot anticipates collisions and navigates safely to reach its destination faster.

Conclusions

This manuscript proposes a safe navigation method for mobile robots using deep reinforcement learning. To teach the robot to maintain comfortable distances from humans, the proposed model formulates the reward function for warning zones based on the size and velocity of humans or dynamic obstacles. This allows the robot to anticipate and predict human motion while navigating in a socially acceptable manner. Furthermore, the distance between the robot’s current position and the goal is evaluated and a reward system is implemented that penalizes the robot for moving away from the goal and rewards the robot for moving closer to the goal. The robot is therefore able to reach the goal in a shorter distance, resulting in smoother trajectories and fewer oscillations. The proposed reward function is tested on three state-of-the-art methods, collision avoidance with deep reinforcement learning, long short-term memory, and social attention with reinforcement learning, and also tested in the Gazebo simulator and the real world with robot operating system (ROS) in three scenarios. The results of extensive experiments were evaluated both quantitatively and qualitatively. Compared to previous formulations, the results demonstrated better overall performance in all scenarios, increasing the robot’s ability to reach the goal with the lowest collision rate, maintain safety, and minimize its arrival time. To improve the robot’s navigation policy in the real world, our future study will exploit and improve the policy with more complicated settings of the environment such as dynamic human groups in a high-density crowd. We aim to encode complex interactions using memory-based neural networks to learn navigation policies that help to achieve collision-free and socially appropriate navigation in crowds of human groups.