Abstract
The growing development of autonomous systems is driving the application of mobile robots in crowded environments. These scenarios often require robots to satisfy multiple conflicting objectives with different relative preferences, such as work efficiency, safety, and smoothness, which inherently cause robots’ poor exploration in seeking policies optimizing several performance criteria. In this paper, we propose a multi-objective deep reinforcement learning framework for crowd-aware robot navigation problems to learn policies over multiple competing objectives whose relative importance preference is dynamic to the robot. First, a two-stream structure is introduced to separately extract the spatial and temporal features of pedestrian motion characteristics. Second, to learn navigation policies for each possible preference, a multi-objective deep reinforcement learning method is proposed to maximize a weighted-sum scalarization of different objective functions. We consider path planning and path tracking tasks, which focus on conflicting objectives of collision avoidance, target reaching, and path following. Experimental results demonstrate that our method can effectively navigate through crowds in simulated environments while satisfying different task requirements.
Similar content being viewed by others
Data availibility
The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
- t, k :
-
Time indexes
- \({\mathcal {S}}\) :
-
State space
- \({\mathcal {A}}\) :
-
Action space
- \({\mathcal {P}}\) :
-
State transition function
- \(\varvec{ r }_{\rm{t}}\) :
-
Vectorized reward function
- \(\mathcal {\gamma }\) :
-
Discount factor
- \(\Omega\) :
-
Preference space
- \(f_{\mathbf {\Omega }}\) :
-
Preference function
- \(\varvec{ \omega }\) :
-
Human preference
- l :
-
Number of objectives
- s :
-
System state
- a :
-
Control action
- \(\varvec{ R }_{\rm{t}}\) :
-
Return of the MOMDP
- \(\pi\) :
-
Policy function
- \({\varvec{V}}^\pi\) :
-
Value function with policy \(\pi\)
- \(\Pi\) :
-
Set of all possible policies
- \(h_{\rm{t}}\) :
-
Spatiotemporal observation of the lidar
- \(o_{\rm{t}}\) :
-
Spatial observation
- \(o_{\rm{max}}\) :
-
Maximum detecting range of the lidar
- \(z_{\rm{t}}\) :
-
Temporal observation
- \(x_{\rm{t}}\) :
-
Cartesian horizontal coordinate of the robot
- \(y_{\rm{t}}\) :
-
Cartesian vertical coordinate of the robot
- \(\theta _{\rm{t}}\) :
-
Orientation angle of the robot
- \(v_{\rm{t}}\) :
-
Linear velocity of the robot
- \(\phi _{\rm{t}}\) :
-
Angular velocity of the robot
- \(\alpha\) :
-
Attenuation coefficient of the linear velocity
- \(\beta\) :
-
Attenuation coefficient of the angular velocity
- \(a_\text{v}\) :
-
Control action of the linear velocity
- \(a_\phi\) :
-
Control action of the angular velocity
- \(\varvec{ d }_g\) :
-
Distance between the robot and the destination
- \(r_\text{p}\) :
-
Reward function of pedestrian avoidance objective
- \(c_\text{p}^1, c_\text{p}^2\) :
-
Given negative constants in \(r_p\)
- \(d_\text{p}\) :
-
Distance between the robot and the closet pedestrian
- \(d_{\text{p},\text{min}}\) :
-
Unsafe distance between the robot and the closet pedestrian
- \(d_{\text{p},\text{max}}\) :
-
Safe distance between the robot and the closet pedestrian
- \(r_\text{s}\) :
-
Reward function of static obstacle avoidance objective
- \(d_\text{s}\) :
-
Distance between the robot and the closet obstacle
- \(d_{\text{s},\text{min}}\) :
-
Minimum braking distance for the robot
- \(r_\text{g}\) :
-
Reward function of reaching the destination objective
- \(c_\text{g}^1\) :
-
Given positive constants in \(r_g\)
- \(c_\text{g}^2,c_\text{g}^3\) :
-
Given negative constants in \(r_g\)
- \(\Delta d_\text{g}\) :
-
Difference of \(d_g\) between two successive steps
- \(d_{\text{g},\text{g}}\) :
-
Successful Distance of reaching the destination
- \(\varvec{ \omega }_{\text{pp}}\) :
-
Human preference in the path planning task
- \(\varvec{ r }_{\text{pp}}\) :
-
Total vectorized reward function of the path planning task
- \(p_1\) :
-
Start point of the guidance path in the path tracking task
- \(p_2\) :
-
End point of the guidance path in the path tracking task
- \(\varphi _\text{e}\) :
-
Angle difference between v and the guidance path
- \(\Delta \varphi _\text{e}\) :
-
Difference of \(\varphi _e\) between two successive steps
- \(v_\text{c}\) :
-
Cross-track linear velocity
- \(v_\text{a}\) :
-
Along-track linear velocity
- \(d_\text{e}\) :
-
Cross-track error between the guidance path and the robot
- \(\Delta d_\text{e}\) :
-
Difference of \(d_e\) between two successive steps
- \(r_\text{f}\) :
-
Part of reward function of path tracking objective
- \(r_\text{a}\) :
-
Total reward function of path tracking objective
- \(c_\text{f}^1,c_\text{f}^2\) :
-
Given negative constants in \(r_f\)
- \(d_{\text{e},\text{r}}\) :
-
Maximum allowable distance from the guidance path
- \(\varvec{ \omega }_{\text{pt}}\) :
-
Human preference in the path tracking task
- \(\varvec{ r }_\text{pt}\) :
-
Total vectorized reward function of the path tracking task
- \({\varvec{Q}}^\pi\) :
-
Multi-objective Q-function with policy \(\pi\)
- \({\varvec{Q}}^*\) :
-
Optimal Multi-objective Q-function
- \({\varvec{T}}\) :
-
Multi-objective Bellman optimality operator
- \({\varvec{H}}\) :
-
Multi-objective Bellman optimality filter
- \(D_{\varvec{\omega }}\) :
-
Human preference distribution
- \(D_{\tau }\) :
-
Replay buffer
- N :
-
Size of replay buffer
- \(\xi\) :
-
Parameters of the Q-function neural network
- \({\hat{\xi }}\) :
-
Parameters of the target Q-function neural network
- M :
-
Size of the mini-batch of transitions
- K :
-
Size of the mini-batch of human preferences
- W :
-
Mini-batch of human preferences
- B :
-
Target Q-function neural network updating interval
- PP:
-
Path planning
- PT:
-
Path tracking
- MODRL:
-
Multi-objective deep reinforcement learning
- MDP:
-
Markov decision process
- MOMDP:
-
Multi-objective Markov decision process
- RL:
-
Reinforcement learning
- CCS:
-
Convex coverage set
- FOV:
-
Field of view
- CNN:
-
Convolution neural network
- LSTM:
-
Long short-term memory
- HIRL:
-
Human interactive reinforcement learning
- DQN:
-
Deep Q-learning
- SFM:
-
Social force model
- SR:
-
Success rate
- CR:
-
Collision rate
- FR:
-
Fail rate
- MT:
-
Mean time
- DF:
-
Discomfort frequency
- ME:
-
Mean error
- STDV:
-
Standard deviation of speed
- ADA:
-
Average difference of angle
References
Su H, Lallo AD, Murphy RR, Taylor RH, Krieger A (2021) Physical human-robot interaction for clinical care in infectious environments. Nat Mach Intell 3(3):184–186
Chen C, Liu Y, Kreiss S, Alahi A (2019) Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. In: International conference on robotics and automation, pp 6015–6022
Fan T, Cheng X, Pan J, Long P, Liu W, Yang R, Manocha D (2019) Getting robots unfrozen and unlost in dense pedestrian crowds. IEEE Robot Autom Lett 4(2):1178–1185
Sathyamoorthy AJ, Patel U, Guan T, Manocha D (2020) Frozone: freezing-free, pedestrian-friendly navigation in human crowds. IEEE Robot Autom Lett 5(3):4352–4359
Trautman P, Krause A (2010) Unfreezing the robot: navigation in dense, interacting crowds. In: 2010 IEEE/RSJ international conference on intelligent robots and systems, pp 797–803
Kayukawa S, Higuchi K, Guerreiro J, Morishima S, Sato Y, Kitani K, Asakawa C (2019) Bbeep: A sonic collision avoidance system for blind travellers and nearby pedestrians. In: CHI conference on human factors in computing systems, pp 1–12
Watanabe A, Ikeda T, Morales Y, Shinozawa K, Miyashita T, Hagita N (2015) Communicating robotic navigational intentions. In: 2015 IEEE/RSJ international conference on intelligent robots and systems, pp 5763–5769
Ferrer G, Zulueta AG, Cotarelo FH, Sanfeliu A (2017) Robot social-aware navigation framework to accompany people walking side-by-side. Auton Robot 41(4):775–793
Van den Berg J, Lin M, Manocha D (2008) Reciprocal velocity obstacles for real-time multi-agent navigation. In: IEEE international conference on robotics and automation, pp 1928–1935
Van Den Berg J, Guy SJ, Lin M, Manocha D (2011) Reciprocal n-body collision avoidance. Robot Res 1:3–19
Trautman P, Ma J, Murray RM, Krause A (2013) Robot navigation in dense human crowds: the case for cooperation. In: IEEE international conference on robotics and automation, pp 2153–2160
Yao X, Wang X, Zhang L, Jiang X (2020) Model predictive and adaptive neural sliding mode control for three-dimensional path following of autonomous underwater vehicle with input saturation. Neural Comput Appl 32(22):16875–16889
Wei J, Zhu B (2022) Model predictive control for trajectory-tracking and formation of wheeled mobile robots. Neural Comput Appl 1:1–15
Chen YF, Liu M, Everett M, How JP (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation, pp 285–292
Everett M, Chen YF, How JP (2018) Motion planning among dynamic, decision-making agents with deep reinforcement learning. In: 2018 IEEE/RSJ international conference on intelligent robots and systems, pp 3052–3059
Chen Y, Liu C, Shi BE, Liu M (2020) Robot navigation in crowds by graph convolutional networks with attention learned from human gaze. IEEE Robot Autom Lett 5(2):2754–2761
Sathyamoorthy AJ, Patel U, Guan T, Manocha D (2020) Frozone: freezing-free, pedestrian-friendly navigation in human crowds. IEEE Robot Autom Lett 5(3):4352–4359
Samsani SS, Muhammad MS (2021) Socially compliant robot navigation in crowded environment by human behavior resemblance using deep reinforcement learning. IEEE Robot Autom Lett 6(3):5223–5230
Nishimura M, Yonetani R (2020) L2b: learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 11004–11010
Jain A, Chen D, Bansal D, Scheele S, Kishore M, Sapra H, Kent D, Ravichandar H, Chernova S (2020) Anticipatory human-robot collaboration via multi-objective trajectory optimization. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 11052–11057
Vamplew P, Foale C, Dazeley R (2022) The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Comput Appl 34(3):1783–1799
Xu J, Tian Y, Ma P, Rus D, Sueda S, Matusik W (2020) Prediction-guided multi-objective reinforcement learning for continuous robot control. In: International conference on machine learning, pp 10607–10616
Ferrer G, Sanfeliu A (2019) Anticipative kinodynamic planning: multi-objective robot navigation in urban and dynamic environments. Auton Robot 43(6):1473–1488
Meyer E, Robinson H, Rasheed A, San O (2020) Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning. IEEE Access 8:41466–41481
Mannor S, Shimkin N (2001) The steering approach for multi-criteria reinforcement learning. In: Advances in neural information processing systems, pp 1563–1570
Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: International conference on machine learning, pp 601–608
Van Moffaert K, Drugan MM, Nowé A (2013) Scalarized multi-objective reinforcement learning: Novel design techniques. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp 191–199
Mossalam H, Assael YM, Roijers DM, Whiteson S (2016) Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707
Abels A, Roijers D, Lenaerts T, Nowé A, Steckelmacher D (2019) Dynamic weights in multi-objective deep reinforcement learning. In: International conference on machine learning, pp 11–20
Yang R, Sun X, Narasimhan K (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Adv Neural Inf Process Syst 32:1
Roijers DM, Vamplew P, Whiteson S, Dazeley R (2013) A survey of multi-objective sequential decision-making. J Artif Intell Res 48:67–113
Lopez VG, Lewis FL (2018) Dynamic multiobjective control for continuous-time systems using reinforcement learning. IEEE Trans Autom Control 64(7):2869–2874
Hayes CF, Rădulescu R, Bargiacchi E, Källström J, Macfarlane M, Reymond M, Verstraeten T, Zintgraf LM, Dazeley R, Heintz F et al (2021) A practical guide to multi-objective reinforcement learning and planning. arXiv preprint arXiv:2103.09568
Nishimura M, Yonetani R (2020) L2b: learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 11004–11010
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199
Wang Y, He H, Sun C (2018) Learning to navigate through complex dynamic environment with modular deep reinforcement learning. IEEE Trans Games 10(4):400–412
Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25
Matveev AS, Teimoori H, Savkin AV (2011) Navigation of a unicycle-like mobile robot for environmental extremum seeking. Automatica 47(1):85–91
Chiang H-TL, Faust A, Fiser M, Francis A (2019) Learning navigation behaviors end-to-end with autorl. IEEE Robot Autom Lett 4(2):2007–2014
Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952
Arzate Cruz C, Igarashi T (2020) A survey on interactive reinforcement learning: design principles and open challenges. In: Proceedings of the 2020 ACM designing interactive systems conference, pp 1195–1209
Thomaz AL, Hoffman G, Breazeal C (2005) Real-time interactive reinforcement learning for robots. In: AAAI 2005 workshop on human comprehensible machine learning, pp 9–13
Yu W, Johansson A (2007) Modeling crowd turbulence by many-particle simulations. Phys Rev E 76(4):046105
Helbing D, Buzna L, Johansson A, Werner T (2005) Self-organized pedestrian crowd dynamics: experiments, simulations, and design solutions. Transp Sci 39(1):1–24
Jiang C, Ni Z, Guo Y, He H (2017) Learning human-robot interaction for robot-assisted pedestrian flow optimization. IEEE Trans Syst Man Cybern: Syst 49(4):797–813
Wan Z, Jiang C, Fahad M, Ni Z, Guo Y, He H (2018) Robot-assisted pedestrian regulation based on deep reinforcement learning. IEEE Trans Cybern 50(4):1669–1682
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Funding
The funding was provided by National Natural Science Foundation of China (Grant Nos. 61921004, 62236002, 62173251, 62103104, 62136008), the "Zhishan" Scholars Programs of Southeast University, and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Social force model
Appendix: Social force model
In the experiments, we use Social Force Model (SFM) to simulate the motion dynamics of pedestrians with human-robot interaction (HRI), which is implemented based on existing literature [43,44,45,46]. In the simulated SFM, the factors that influence the pedestrian’s motion include their internal motivations and the external forces exerted on them. The bold variables represent vectors. The specific motion dynamics of pedestrian i can be expressed as
where \(m_i\) is the mass of pedestrian, \(v_i\) is the current velocity, and \(f_i\) is the acceleration force of pedestrian. \(f_i\) is expressed as
where \({\textbf{f}}_{id}\) denotes the self-driving force, \({\textbf{f}}_{ij}\), \({\textbf{f}}_{iw}\), and \({\textbf{f}}_{ir}\) are the external forces exerted on the pedestrian from other pedestrians, the walls, and the robot, respectively. The details are expressed as follows:
-
1.
The self-driving force: This force depends on the pedestrian’s internal motivation, which reflects the intention of adjusting his/her direction and velocity to arrive at the destination. Suppose the desired direction is denoted as \({\textbf{e}}_i\), which points from the current position to the destination, and the desired velocity is \(v_i^0\). Then, the self-driving force is given by [43]
$$\begin{aligned} {\textbf{f}}_{id}(t)=\frac{m_i}{\tau }(v_i^0{\textbf{e}}_i-{\textbf{v}}_i), \end{aligned}$$(17)where \(\tau\) denotes the relaxation time during which the discrepancy between the desired velocity and the current velocity.
-
2.
The social force exerted by other pedestrians: This force stems from the repulsive force among pedestrians, which represents their desire to keep a safe distance from nearby humans and obtain more space in crowded environments. It is expressed as [43]
$$\begin{aligned} {\textbf{f}}_{ij}(t)=F\Theta _{ij}\rm{exp}[-d_{ij}/D_0+(D_1/d_{ij})^k]{\textbf{e}}_{ij}, \end{aligned}$$(18)where F denotes the maximum repulsive force, \(d_{ij}\) is the distance between pedestrian i and pedestrian j, and \({\textbf{e}}_{ij}\) is the normalized vector pointing from pedestrian i to pedestrian j. \(D_0\), \(D_1\), and k are related constant parameters. \(\Theta _{ij}\) reflects the anisotropic character of the repulsive force due to the limited field of each pedestrian and is expressed as
$$\begin{aligned} \Theta _{ij} = \lambda _i+(1-\lambda _i)\frac{1+\rm{cos}(\phi _{ij})}{2}, \end{aligned}$$(19)where \(\phi _{ij}\) is a constant parameter and with \(\phi _{ij}<1\), we can model the situation that pedestrians react much stronger to things happening before than behind. \(\phi _{ij}\) is the angle between the desired direction \({\textbf{e}}_i\) and the relative angle \({\textbf{e}}_{ij}\).
-
3.
The social force exerted by walls: This force reflects that pedestrians want to keep a safe distance from walls in crowded places. This repulsive force is expressed as [44]
$$\begin{aligned} {\textbf{f}}_{iw}(t) = A_{iw}[(r_i-d_{iw})/B_{iw}]{\textbf{n}}_{iw}, \end{aligned}$$(20)where \(d_{iw}\) denotes the nearest distance between the pedestrian and the wall, and \({\textbf{n}}_{iw}\) is the direction pointing from the position of the pedestrian to the nearest point of the wall. \(A_{iw}\) and \(B_{iw}\) denote the strength and the range of the respective interaction force. \(r_i\) is the radius of the pedestrian.
-
4.
The interaction force exerted by robot: This force reflects the human-robot interaction force which is modeled from the perspective of social force. It is expressed as [45]
$$\begin{aligned} {\textbf{f}}_{ir}(t) = A_{ir}[(r_{ir}-d_{ir})/B_{ir}]{\textbf{n}}_{ir}\Theta _{ir}, \end{aligned}$$(21)where \(d_{ir}\) is the distance between the pedestrian and the robot, and \({\textbf{n}}_{ir}\) is the vector pointing from the pedestrian to the robot. \(A_{ir}\) and \(B_{ir}\) denote the strength and the range of the respective human-robot interaction force. \(r_{ir}\) is the sum of the pedestrian radius \(r_i\) and the robot radius \(r_r\).
The parameters of our implemented model are shown in Table 6, which are chosen based on [46]. \({\mathcal {N}}(\mu ,\sigma ^2)\) denotes the Gaussian distribution with mean \(\mu\) and standard derivation \(\sigma\).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cheng, G., Wang, Y., Dong, L. et al. Multi-objective deep reinforcement learning for crowd-aware robot navigation with dynamic human preference. Neural Comput & Applic 35, 16247–16265 (2023). https://doi.org/10.1007/s00521-023-08385-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08385-4