Skip to main content
Log in

Multi-objective deep reinforcement learning for crowd-aware robot navigation with dynamic human preference

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The growing development of autonomous systems is driving the application of mobile robots in crowded environments. These scenarios often require robots to satisfy multiple conflicting objectives with different relative preferences, such as work efficiency, safety, and smoothness, which inherently cause robots’ poor exploration in seeking policies optimizing several performance criteria. In this paper, we propose a multi-objective deep reinforcement learning framework for crowd-aware robot navigation problems to learn policies over multiple competing objectives whose relative importance preference is dynamic to the robot. First, a two-stream structure is introduced to separately extract the spatial and temporal features of pedestrian motion characteristics. Second, to learn navigation policies for each possible preference, a multi-objective deep reinforcement learning method is proposed to maximize a weighted-sum scalarization of different objective functions. We consider path planning and path tracking tasks, which focus on conflicting objectives of collision avoidance, target reaching, and path following. Experimental results demonstrate that our method can effectively navigate through crowds in simulated environments while satisfying different task requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availibility

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

tk :

Time indexes

\({\mathcal {S}}\) :

State space

\({\mathcal {A}}\) :

Action space

\({\mathcal {P}}\) :

State transition function

\(\varvec{ r }_{\rm{t}}\) :

Vectorized reward function

\(\mathcal {\gamma }\) :

Discount factor

\(\Omega\) :

Preference space

\(f_{\mathbf {\Omega }}\) :

Preference function

\(\varvec{ \omega }\) :

Human preference

l :

Number of objectives

s :

System state

a :

Control action

\(\varvec{ R }_{\rm{t}}\) :

Return of the MOMDP

\(\pi\) :

Policy function

\({\varvec{V}}^\pi\) :

Value function with policy \(\pi\)

\(\Pi\) :

Set of all possible policies

\(h_{\rm{t}}\) :

Spatiotemporal observation of the lidar

\(o_{\rm{t}}\) :

Spatial observation

\(o_{\rm{max}}\) :

Maximum detecting range of the lidar

\(z_{\rm{t}}\) :

Temporal observation

\(x_{\rm{t}}\) :

Cartesian horizontal coordinate of the robot

\(y_{\rm{t}}\) :

Cartesian vertical coordinate of the robot

\(\theta _{\rm{t}}\) :

Orientation angle of the robot

\(v_{\rm{t}}\) :

Linear velocity of the robot

\(\phi _{\rm{t}}\) :

Angular velocity of the robot

\(\alpha\) :

Attenuation coefficient of the linear velocity

\(\beta\) :

Attenuation coefficient of the angular velocity

\(a_\text{v}\) :

Control action of the linear velocity

\(a_\phi\) :

Control action of the angular velocity

\(\varvec{ d }_g\) :

Distance between the robot and the destination

\(r_\text{p}\) :

Reward function of pedestrian avoidance objective

\(c_\text{p}^1, c_\text{p}^2\) :

Given negative constants in \(r_p\)

\(d_\text{p}\) :

Distance between the robot and the closet pedestrian

\(d_{\text{p},\text{min}}\) :

Unsafe distance between the robot and the closet pedestrian

\(d_{\text{p},\text{max}}\) :

Safe distance between the robot and the closet pedestrian

\(r_\text{s}\) :

Reward function of static obstacle avoidance objective

\(d_\text{s}\) :

Distance between the robot and the closet obstacle

\(d_{\text{s},\text{min}}\) :

Minimum braking distance for the robot

\(r_\text{g}\) :

Reward function of reaching the destination objective

\(c_\text{g}^1\) :

Given positive constants in \(r_g\)

\(c_\text{g}^2,c_\text{g}^3\) :

Given negative constants in \(r_g\)

\(\Delta d_\text{g}\) :

Difference of \(d_g\) between two successive steps

\(d_{\text{g},\text{g}}\) :

Successful Distance of reaching the destination

\(\varvec{ \omega }_{\text{pp}}\) :

Human preference in the path planning task

\(\varvec{ r }_{\text{pp}}\) :

Total vectorized reward function of the path planning task

\(p_1\) :

Start point of the guidance path in the path tracking task

\(p_2\) :

End point of the guidance path in the path tracking task

\(\varphi _\text{e}\) :

Angle difference between v and the guidance path

\(\Delta \varphi _\text{e}\) :

Difference of \(\varphi _e\) between two successive steps

\(v_\text{c}\) :

Cross-track linear velocity

\(v_\text{a}\) :

Along-track linear velocity

\(d_\text{e}\) :

Cross-track error between the guidance path and the robot

\(\Delta d_\text{e}\) :

Difference of \(d_e\) between two successive steps

\(r_\text{f}\) :

Part of reward function of path tracking objective

\(r_\text{a}\) :

Total reward function of path tracking objective

\(c_\text{f}^1,c_\text{f}^2\) :

Given negative constants in \(r_f\)

\(d_{\text{e},\text{r}}\) :

Maximum allowable distance from the guidance path

\(\varvec{ \omega }_{\text{pt}}\) :

Human preference in the path tracking task

\(\varvec{ r }_\text{pt}\) :

Total vectorized reward function of the path tracking task

\({\varvec{Q}}^\pi\) :

Multi-objective Q-function with policy \(\pi\)

\({\varvec{Q}}^*\) :

Optimal Multi-objective Q-function

\({\varvec{T}}\) :

Multi-objective Bellman optimality operator

\({\varvec{H}}\) :

Multi-objective Bellman optimality filter

\(D_{\varvec{\omega }}\) :

Human preference distribution

\(D_{\tau }\) :

Replay buffer

N :

Size of replay buffer

\(\xi\) :

Parameters of the Q-function neural network

\({\hat{\xi }}\) :

Parameters of the target Q-function neural network

M :

Size of the mini-batch of transitions

K :

Size of the mini-batch of human preferences

W :

Mini-batch of human preferences

B :

Target Q-function neural network updating interval

PP:

Path planning

PT:

Path tracking

MODRL:

Multi-objective deep reinforcement learning

MDP:

Markov decision process

MOMDP:

Multi-objective Markov decision process

RL:

Reinforcement learning

CCS:

Convex coverage set

FOV:

Field of view

CNN:

Convolution neural network

LSTM:

Long short-term memory

HIRL:

Human interactive reinforcement learning

DQN:

Deep Q-learning

SFM:

Social force model

SR:

Success rate

CR:

Collision rate

FR:

Fail rate

MT:

Mean time

DF:

Discomfort frequency

ME:

Mean error

STDV:

Standard deviation of speed

ADA:

Average difference of angle

References

  1. Su H, Lallo AD, Murphy RR, Taylor RH, Krieger A (2021) Physical human-robot interaction for clinical care in infectious environments. Nat Mach Intell 3(3):184–186

    Article  Google Scholar 

  2. Chen C, Liu Y, Kreiss S, Alahi A (2019) Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. In: International conference on robotics and automation, pp 6015–6022

  3. Fan T, Cheng X, Pan J, Long P, Liu W, Yang R, Manocha D (2019) Getting robots unfrozen and unlost in dense pedestrian crowds. IEEE Robot Autom Lett 4(2):1178–1185

    Article  Google Scholar 

  4. Sathyamoorthy AJ, Patel U, Guan T, Manocha D (2020) Frozone: freezing-free, pedestrian-friendly navigation in human crowds. IEEE Robot Autom Lett 5(3):4352–4359

    Article  Google Scholar 

  5. Trautman P, Krause A (2010) Unfreezing the robot: navigation in dense, interacting crowds. In: 2010 IEEE/RSJ international conference on intelligent robots and systems, pp 797–803

  6. Kayukawa S, Higuchi K, Guerreiro J, Morishima S, Sato Y, Kitani K, Asakawa C (2019) Bbeep: A sonic collision avoidance system for blind travellers and nearby pedestrians. In: CHI conference on human factors in computing systems, pp 1–12

  7. Watanabe A, Ikeda T, Morales Y, Shinozawa K, Miyashita T, Hagita N (2015) Communicating robotic navigational intentions. In: 2015 IEEE/RSJ international conference on intelligent robots and systems, pp 5763–5769

  8. Ferrer G, Zulueta AG, Cotarelo FH, Sanfeliu A (2017) Robot social-aware navigation framework to accompany people walking side-by-side. Auton Robot 41(4):775–793

    Article  Google Scholar 

  9. Van den Berg J, Lin M, Manocha D (2008) Reciprocal velocity obstacles for real-time multi-agent navigation. In: IEEE international conference on robotics and automation, pp 1928–1935

  10. Van Den Berg J, Guy SJ, Lin M, Manocha D (2011) Reciprocal n-body collision avoidance. Robot Res 1:3–19

    Article  MATH  Google Scholar 

  11. Trautman P, Ma J, Murray RM, Krause A (2013) Robot navigation in dense human crowds: the case for cooperation. In: IEEE international conference on robotics and automation, pp 2153–2160

  12. Yao X, Wang X, Zhang L, Jiang X (2020) Model predictive and adaptive neural sliding mode control for three-dimensional path following of autonomous underwater vehicle with input saturation. Neural Comput Appl 32(22):16875–16889

    Article  Google Scholar 

  13. Wei J, Zhu B (2022) Model predictive control for trajectory-tracking and formation of wheeled mobile robots. Neural Comput Appl 1:1–15

    Google Scholar 

  14. Chen YF, Liu M, Everett M, How JP (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation, pp 285–292

  15. Everett M, Chen YF, How JP (2018) Motion planning among dynamic, decision-making agents with deep reinforcement learning. In: 2018 IEEE/RSJ international conference on intelligent robots and systems, pp 3052–3059

  16. Chen Y, Liu C, Shi BE, Liu M (2020) Robot navigation in crowds by graph convolutional networks with attention learned from human gaze. IEEE Robot Autom Lett 5(2):2754–2761

    Article  Google Scholar 

  17. Sathyamoorthy AJ, Patel U, Guan T, Manocha D (2020) Frozone: freezing-free, pedestrian-friendly navigation in human crowds. IEEE Robot Autom Lett 5(3):4352–4359

    Article  Google Scholar 

  18. Samsani SS, Muhammad MS (2021) Socially compliant robot navigation in crowded environment by human behavior resemblance using deep reinforcement learning. IEEE Robot Autom Lett 6(3):5223–5230

    Article  Google Scholar 

  19. Nishimura M, Yonetani R (2020) L2b: learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 11004–11010

  20. Jain A, Chen D, Bansal D, Scheele S, Kishore M, Sapra H, Kent D, Ravichandar H, Chernova S (2020) Anticipatory human-robot collaboration via multi-objective trajectory optimization. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 11052–11057

  21. Vamplew P, Foale C, Dazeley R (2022) The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Comput Appl 34(3):1783–1799

    Article  Google Scholar 

  22. Xu J, Tian Y, Ma P, Rus D, Sueda S, Matusik W (2020) Prediction-guided multi-objective reinforcement learning for continuous robot control. In: International conference on machine learning, pp 10607–10616

  23. Ferrer G, Sanfeliu A (2019) Anticipative kinodynamic planning: multi-objective robot navigation in urban and dynamic environments. Auton Robot 43(6):1473–1488

    Article  Google Scholar 

  24. Meyer E, Robinson H, Rasheed A, San O (2020) Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning. IEEE Access 8:41466–41481

    Article  Google Scholar 

  25. Mannor S, Shimkin N (2001) The steering approach for multi-criteria reinforcement learning. In: Advances in neural information processing systems, pp 1563–1570

  26. Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: International conference on machine learning, pp 601–608

  27. Van Moffaert K, Drugan MM, Nowé A (2013) Scalarized multi-objective reinforcement learning: Novel design techniques. In: 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp 191–199

  28. Mossalam H, Assael YM, Roijers DM, Whiteson S (2016) Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707

  29. Abels A, Roijers D, Lenaerts T, Nowé A, Steckelmacher D (2019) Dynamic weights in multi-objective deep reinforcement learning. In: International conference on machine learning, pp 11–20

  30. Yang R, Sun X, Narasimhan K (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Adv Neural Inf Process Syst 32:1

    Google Scholar 

  31. Roijers DM, Vamplew P, Whiteson S, Dazeley R (2013) A survey of multi-objective sequential decision-making. J Artif Intell Res 48:67–113

    Article  MathSciNet  MATH  Google Scholar 

  32. Lopez VG, Lewis FL (2018) Dynamic multiobjective control for continuous-time systems using reinforcement learning. IEEE Trans Autom Control 64(7):2869–2874

    Article  MathSciNet  MATH  Google Scholar 

  33. Hayes CF, Rădulescu R, Bargiacchi E, Källström J, Macfarlane M, Reymond M, Verstraeten T, Zintgraf LM, Dazeley R, Heintz F et al (2021) A practical guide to multi-objective reinforcement learning and planning. arXiv preprint arXiv:2103.09568

  34. Nishimura M, Yonetani R (2020) L2b: learning to balance the safety-efficiency trade-off in interactive crowd-aware robot navigation. In: 2020 IEEE/RSJ international conference on intelligent robots and systems, pp 11004–11010

  35. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199

  36. Wang Y, He H, Sun C (2018) Learning to navigate through complex dynamic environment with modular deep reinforcement learning. IEEE Trans Games 10(4):400–412

    Article  Google Scholar 

  37. Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25

    Article  Google Scholar 

  38. Matveev AS, Teimoori H, Savkin AV (2011) Navigation of a unicycle-like mobile robot for environmental extremum seeking. Automatica 47(1):85–91

    Article  MathSciNet  MATH  Google Scholar 

  39. Chiang H-TL, Faust A, Fiser M, Francis A (2019) Learning navigation behaviors end-to-end with autorl. IEEE Robot Autom Lett 4(2):2007–2014

    Article  Google Scholar 

  40. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952

  41. Arzate Cruz C, Igarashi T (2020) A survey on interactive reinforcement learning: design principles and open challenges. In: Proceedings of the 2020 ACM designing interactive systems conference, pp 1195–1209

  42. Thomaz AL, Hoffman G, Breazeal C (2005) Real-time interactive reinforcement learning for robots. In: AAAI 2005 workshop on human comprehensible machine learning, pp 9–13

  43. Yu W, Johansson A (2007) Modeling crowd turbulence by many-particle simulations. Phys Rev E 76(4):046105

    Article  Google Scholar 

  44. Helbing D, Buzna L, Johansson A, Werner T (2005) Self-organized pedestrian crowd dynamics: experiments, simulations, and design solutions. Transp Sci 39(1):1–24

    Article  Google Scholar 

  45. Jiang C, Ni Z, Guo Y, He H (2017) Learning human-robot interaction for robot-assisted pedestrian flow optimization. IEEE Trans Syst Man Cybern: Syst 49(4):797–813

    Article  Google Scholar 

  46. Wan Z, Jiang C, Fahad M, Ni Z, Guo Y, He H (2018) Robot-assisted pedestrian regulation based on deep reinforcement learning. IEEE Trans Cybern 50(4):1669–1682

    Article  Google Scholar 

  47. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  48. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

Download references

Funding

The funding was provided by National Natural Science Foundation of China (Grant Nos. 61921004, 62236002, 62173251, 62103104, 62136008), the "Zhishan" Scholars Programs of Southeast University, and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changyin Sun.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Social force model

Appendix: Social force model

In the experiments, we use Social Force Model (SFM) to simulate the motion dynamics of pedestrians with human-robot interaction (HRI), which is implemented based on existing literature [43,44,45,46]. In the simulated SFM, the factors that influence the pedestrian’s motion include their internal motivations and the external forces exerted on them. The bold variables represent vectors. The specific motion dynamics of pedestrian i can be expressed as

$$\begin{aligned} m_i\frac{dv_i}{dt}={\textbf{f}}_i(t), \end{aligned}$$
(15)

where \(m_i\) is the mass of pedestrian, \(v_i\) is the current velocity, and \(f_i\) is the acceleration force of pedestrian. \(f_i\) is expressed as

$$\begin{aligned} {\textbf{f}}_i(t) = {\textbf{f}}_{id}(t)+\sum _{j\ne i}{\textbf{f}}_{ij}(t)+\sum _{w}{\textbf{f}}_{iw}(t)+{\textbf{f}}_{ir}(t), \end{aligned}$$
(16)

where \({\textbf{f}}_{id}\) denotes the self-driving force, \({\textbf{f}}_{ij}\), \({\textbf{f}}_{iw}\), and \({\textbf{f}}_{ir}\) are the external forces exerted on the pedestrian from other pedestrians, the walls, and the robot, respectively. The details are expressed as follows:

  1. 1.

    The self-driving force: This force depends on the pedestrian’s internal motivation, which reflects the intention of adjusting his/her direction and velocity to arrive at the destination. Suppose the desired direction is denoted as \({\textbf{e}}_i\), which points from the current position to the destination, and the desired velocity is \(v_i^0\). Then, the self-driving force is given by [43]

    $$\begin{aligned} {\textbf{f}}_{id}(t)=\frac{m_i}{\tau }(v_i^0{\textbf{e}}_i-{\textbf{v}}_i), \end{aligned}$$
    (17)

    where \(\tau\) denotes the relaxation time during which the discrepancy between the desired velocity and the current velocity.

  2. 2.

    The social force exerted by other pedestrians: This force stems from the repulsive force among pedestrians, which represents their desire to keep a safe distance from nearby humans and obtain more space in crowded environments. It is expressed as [43]

    $$\begin{aligned} {\textbf{f}}_{ij}(t)=F\Theta _{ij}\rm{exp}[-d_{ij}/D_0+(D_1/d_{ij})^k]{\textbf{e}}_{ij}, \end{aligned}$$
    (18)

    where F denotes the maximum repulsive force, \(d_{ij}\) is the distance between pedestrian i and pedestrian j, and \({\textbf{e}}_{ij}\) is the normalized vector pointing from pedestrian i to pedestrian j. \(D_0\), \(D_1\), and k are related constant parameters. \(\Theta _{ij}\) reflects the anisotropic character of the repulsive force due to the limited field of each pedestrian and is expressed as

    $$\begin{aligned} \Theta _{ij} = \lambda _i+(1-\lambda _i)\frac{1+\rm{cos}(\phi _{ij})}{2}, \end{aligned}$$
    (19)

    where \(\phi _{ij}\) is a constant parameter and with \(\phi _{ij}<1\), we can model the situation that pedestrians react much stronger to things happening before than behind. \(\phi _{ij}\) is the angle between the desired direction \({\textbf{e}}_i\) and the relative angle \({\textbf{e}}_{ij}\).

  3. 3.

    The social force exerted by walls: This force reflects that pedestrians want to keep a safe distance from walls in crowded places. This repulsive force is expressed as [44]

    $$\begin{aligned} {\textbf{f}}_{iw}(t) = A_{iw}[(r_i-d_{iw})/B_{iw}]{\textbf{n}}_{iw}, \end{aligned}$$
    (20)

    where \(d_{iw}\) denotes the nearest distance between the pedestrian and the wall, and \({\textbf{n}}_{iw}\) is the direction pointing from the position of the pedestrian to the nearest point of the wall. \(A_{iw}\) and \(B_{iw}\) denote the strength and the range of the respective interaction force. \(r_i\) is the radius of the pedestrian.

  4. 4.

    The interaction force exerted by robot: This force reflects the human-robot interaction force which is modeled from the perspective of social force. It is expressed as [45]

    $$\begin{aligned} {\textbf{f}}_{ir}(t) = A_{ir}[(r_{ir}-d_{ir})/B_{ir}]{\textbf{n}}_{ir}\Theta _{ir}, \end{aligned}$$
    (21)

    where \(d_{ir}\) is the distance between the pedestrian and the robot, and \({\textbf{n}}_{ir}\) is the vector pointing from the pedestrian to the robot. \(A_{ir}\) and \(B_{ir}\) denote the strength and the range of the respective human-robot interaction force. \(r_{ir}\) is the sum of the pedestrian radius \(r_i\) and the robot radius \(r_r\).

The parameters of our implemented model are shown in Table 6, which are chosen based on [46]. \({\mathcal {N}}(\mu ,\sigma ^2)\) denotes the Gaussian distribution with mean \(\mu\) and standard derivation \(\sigma\).

Table 6 List of social force model parameters

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, G., Wang, Y., Dong, L. et al. Multi-objective deep reinforcement learning for crowd-aware robot navigation with dynamic human preference. Neural Comput & Applic 35, 16247–16265 (2023). https://doi.org/10.1007/s00521-023-08385-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08385-4

Keywords

Navigation