Offline policy reuse-guided anytime online collective multiagent planning and its application to mobility-on-demand systems

Wang, Wanyuan; Che, Qian; Zhou, Yifeng; Wu, Weiwei; An, Bo; Jiang, Yichuan

doi:10.1007/s10458-024-09650-z

Offline policy reuse-guided anytime online collective multiagent planning and its application to mobility-on-demand systems

Published: 16 May 2024

Volume 38, article number 19, (2024)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Wanyuan Wang¹,
Qian Che¹,
Yifeng Zhou²,
Weiwei Wu¹,
Bo An³ &
…
Yichuan Jiang¹

50 Accesses
Explore all metrics

Abstract

The popularity of mobility-on-demand (MoD) systems boosts online collective multiagent planning (Online_CMP), where spatially distributed servicing agents are planned to meet dynamically arriving demands. For city-scale MoDs with a fleet of agents, Online_CMP methods must make a tradeoff between computation time (i.e., real-time) and solution quality (i.e., the number of demands served). Directly using an offline policy can guarantee real-time, but cannot be dynamically adjusted to real agent and demand distributions. Search-based online planning methods are adaptive, but are computationally expensive and cannot scale up. In this paper, we propose a principled Online_CMP method, which reuses and improves the offline policy in an anytime manner. We first model MoDs as a collective Markov Decision Process (\({\mathbb {C}}\)-MDP) where the collective behavior of agents affects the joint reward. Given the \({\mathbb {C}}\)-MDP model, we propose a novel state value function to evaluate the policy, and a gradient ascent (GA) technique to improve the policy. We further show that offline GA-based policy iteration (GA-PI) can converge to global optima of \({\mathbb {C}}\)-MDP under certain conditions. Finally, with real-time information, the offline policy is used as the default plan, GA-PI is used to improve it and generate an online plan. Experimental results show that our offline policy reuse-guided Online_CMP method significantly outperforms standard online multiagent planning methods on MoD systems like ride-sharing and security traffic patrolling in terms of computation time and solution quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Load Balancing in Distributed Multi-Agent Path Finder (DMAPF)

Algorithms to Manage Congestion in Large-Scale Mobility-on-Demand Schemes that Use Electric Vehicles

Article 22 May 2021

Dispatching Strategies for Dynamic Vehicle Routing Problems

Availability of data and materials

The dataset used will be open-sourced once this paper is accepted.

Notes

Our method can be easily extended to general settings where agents have a randomized distribution on states.
From here on in our discussion we will assume no discounting, although for completeness we do include the possibility of discounting in the algorithm.
https://github.com/TrafficRun.

References

(2016). Taxi and limousine commission (tlc) trip record data. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Alonso-Mora, J., Samaranayake, S., Wallar, A., Frazzoli, E., & Rus, D. (2017). On-demand high-capacity ride-sharing via dynamic trip-vehicle assignment. Proceedings of the National Academy of Sciences, 114(3), 462–467.
Article Google Scholar
Bellman, R. E. (1957). Dynamic programming, adaptive computation and machine learning. Princeton University Press.
Bertsekas, D. P. (2005). Dynamic programming and optimal control (3rd ed.). Athena Scientific.
Bistaffa, F., Farinelli, A. & Ramchurn, S. D. (2015), Sharing rides with friends: A coalition formation algorithm for ridesharing. In Proceedings of the 29th AAAI conference on artificial intelligence, January 25–30, 2015, Austin, TX, USA (pp. 608–614).
Boyd, S., & Vandenberghe, L. (2004). Convex optimization (1st Ed.). Cambridge University Press.
Brown, M., Saisubramanian, S., Varakantham, P., & Tambe M. (2014). STREETS: Game-theoretic traffic patrolling with exploration and exploitation. In AAAI’14 (pp. 2966–2971).
Chaudhari, H. A., Byers, J. W. & Terzi, E. (2020). Learn to earn: Enabling coordination within a ride-hailing fleet. In IEEE international conference on big data, big data 2020, Atlanta, GA, USA, December 10–13, 2020 (pp. 1127–1136).
Claes, D., Oliehoek, FA., Baier, H., & Tuyls, K. (2017). Decentralised online planning for multi-robot warehouse commissioning. In Proceedings of the 16th conference on autonomous agents and multiAgent systems, AAMAS 2017, São Paulo, Brazil, May 8–12 (pp. 492–500).
Dickerson, J. P., Sankararaman, K. A., Srinivasan, A., & Xu, P. (2018a). Allocation problems in ride-sharing platforms: Online matching with offline reusable resources. In Proceedings of the 32nd AAAI conference on artificial intelligence (AAAI’18), New Orleans, Louisiana, USA, February 2–7, 2018 (pp. 1007–1014).
Dickerson, J. P., Sankararaman, K. A., Srinivasan, A., & Xu, P. (2018b). Assigning tasks to workers based on historical data: Online task assignment with two-sided arrivals. In Proceedings of the 17th international conference on autonomous agents and multiAgent systems, AAMAS 2018, Stockholm, Sweden, July 10–15, 2018 (pp. 318–326).
Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In ICML’08 (pp. 272–279).
Flaxman, A., Kalai, A. T. & McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA’05 (pp. 385–394).
Gilpin, A., Hoda, S., Peña, J., & Sandholm, T. (2007). Gradient-based algorithms for finding Nash equilibria in extensive form games. In WINE’07 (pp. 57–69).
He, S. & Shin, K. G. (2019). Spatio-temporal capsule-based reinforcement learning for mobility-on-demand network coordination. In The World Wide Web conference, WWW 2019, San Francisco, CA, USA, May 13–17, 2019 (pp. 2806–2813).
Jin, J., Zhou, M., Zhang, W., Li, M., Guo, Z., Qin, Z., Jiao, Y., Tang, X., Wang, C., Wang, J., & Wu, G. (2019). Coride: Joint order dispatching and fleet management for multi-scale ride-hailing platforms. In Proceedings of the 28th ACM international conference on information and knowledge management, CIKM 2019, Beijing, China, November 3–7, 2019 (pp. 1983–1992).
Kumar, R. R. & Varakantham, P. (2017). Exploiting anonymity and homogeneity in factored dec-mdps through precomputed binomial distributions. In Proceedings of the 16th conference on autonomous agents and multiAgent systems, AAMAS’17 (pp. 732–740).
Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu, G., & Ye, J. (2019). Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In The World Wide Web conference, WWW 2019, San Francisco, CA, USA, May 13–17, 2019 (pp. 983–994).
Lin, K., Zhao, R., Xu, Z., & Zhou, J. (2018). Efficient large-scale fleet management via multi-agent deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, KDD 2018, London, UK, August 19–23, 2018 (pp. 1774–1783).
Lowalekar, M., Varakantham, P. & Jaillet, P. (2020). Competitive ratios for online multi-capacity ridesharing. In Proceedings of the 19th international conference on autonomous agents and multiagent systems, AAMAS ’20, Auckland, New Zealand, May 9–13, 2020 (pp. 771–779).
Lowalekar, M., Varakantham, P., Ghosh, S., & Jena, S., & Jaillet, P. (2017). Online repositioning in bike sharing systems. In Proceedings of the 27th international conference on automated planning and scheduling (ICAPS’17), Pittsburgh, Pennsylvania, USA, June 18–23, 2017 (pp. 200–208).
Lowalekar, M., Varakantham, P., & Jaillet, P. (2018). Online spatio-temporal matching in stochastic and dynamic domains. Artificial Intelligence, 261, 71–112.
Article MathSciNet Google Scholar
Ma, S., Zheng, Y., & Wolfson, O. (2015). Real-time city-scale taxi ridesharing. IEEE Transactions on Knowledge and Data Engineering, 27(7), 1782–1795.
Article Google Scholar
Mukhopadhyay, A., Vorobeychik, Y., & Dubey, A. (2017). Prioritized allocation of emergency responders based on a continuous-time incident prediction model. In Proceedings of the 16th conference on autonomous agents and multiAgent systems (AAMAS’17), São Paulo, Brazil, May 8–12, 2017 (pp. 168–177).
Nguyen, D. T., Kumar, A. & Lau, H. C. (2017a). Collective multiagent sequential decision making under uncertainty. In Proceedings of the 31st AAAI conference on artificial intelligence, February 4–9, 2017, San Francisco, California, USA (pp. 3036–3043).
Nguyen, D. T., Kumar, A. & Lau, H. C. (2017b). Policy gradient with value function approximation for collective multiagent planning. In Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, December 4–9, 2017, Long Beach, CA, USA (pp. 4319–4329).
Nguyen, D. T., Kumar, A., & Lau, H. C. (2018). Credit assignment for collective multiagent rl with global rewards. In NeuIPS’18 (pp. 8102–8113).
Qin, Z. T., Tang, X., Jiao, Y., Zhang, F., Xu, Z., Zhu, H., & Ye, J. (2020). Ride-hailing order dispatching at didi via reinforcement learning. Interfaces, 50(5), 272–286.
Google Scholar
Rosenfeld, A. & Kraus, S. (2017). When security games hit traffic: Optimal traffic enforcement under one sided uncertainty. In Sierra C (Ed.), IJCAI’17 (pp. 3814–3822).
Rosenfeld, A., Maksimov, O., & Kraus, S. (2020). When security games hit traffic: A deployed optimal traffic enforcement system. Artificial Intelligence, 289(103), 381.
Google Scholar
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems 12, [NIPS conference, Denver, Colorado, USA, November 29–December 4, 1999] (pp. 1057–1063).
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning—An introduction. Adaptive computation and machine learning. MIT Press.
Tang, X., Zhang, F., Qin, Z., Wang, Y., Shi, D., Song, B., Tong, Y., Zhu, H., & Ye, J. (2021). Value function is all you need: A unified learning framework for ride hailing platforms. In KDD’21.
Taylor, M. E., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10, 1633–1685.
MathSciNet Google Scholar
Tong, Y., Zeng, Y., Ding, B., Wang, L., & Chen, L. (2021). Two-sided online micro-task assignment in spatial crowdsourcing. IIEEE Transactions on Knowledge and Data Engineering, 33(5), 2295–2309.
Google Scholar
Varakantham, P., Adulyasak, Y. & Jaillet, P. (2014). Decentralized stochastic planning with anonymity in interactions. In AAAI’14 (pp. 2505–2512).
Wang, W., Dong, Z., An, B., & Jiang, Y. (2021). Toward efficient city-scale patrol planning using decomposition and grafting. IEEE Transactions on Intelligent Transportation Systems, 22(2), 747–757.
Article Google Scholar
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine learning (pp. 229–256).
Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W., & Ye, J. (2018). Large-scale order dispatch in on-demand ride-hailing platforms: A learning and planning approach. In KDD’18 (pp. 905–913).
Xu, Y., Wang, W., Xiong, G., Liu, X., Wu, W., & Liu, K. (2021). Network-flow-based efficient vehicle dispatch for city-scale ride-hailing systems. Transactions on Intelligent Transportation Systems. https://doi.org/10.1109/TITS.2021.3054893
Article Google Scholar
Yue, Y., Marla, L. & Krishnan, R. (2012). An efficient simulation-based approach to ambulance fleet allocation and dynamic redeployment. In Proceedings of the 26th AAAI conference on artificial intelligence, July 22–26, 2012, Toronto, ON, Canada.
Zhou, M., Jin, J., Zhang, W., Qin, Z., Jiao, Y., Wang, C., Wu, G., Yu, Y., & Ye, J. (2019). Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching. In Proceedings of the 28th ACM international conference on information and knowledge management, CIKM 2019, Beijing, China, November 3–7, 2019 (pp. 2645–2653).

Download references

Acknowledgements

This research was supported in part by the Key Research and Development Projects in Jiangsu Province under Grant BE2021001-2, and in part by the National Natural Science Foundation of China (62076060, 62072099, 61932007, 61806053).

Funding

This research was supported in part by the Key Research and Development Projects in Jiangsu Province under Grant BE2021001-2, and in part by the National Natural Science Foundation of China (62076060, 62072099, 61932007, 61806053).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, 211189, China
Wanyuan Wang, Qian Che, Weiwei Wu & Yichuan Jiang
School of Computer Science, Nanjing Audit University, Nanjing, 211815, China
Yifeng Zhou
School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798, Singapore
Bo An

Authors

Wanyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qian Che
View author publications
You can also search for this author in PubMed Google Scholar
Yifeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Bo An
View author publications
You can also search for this author in PubMed Google Scholar
Yichuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wanyuan Wang and Qian Che designed the algorithm and implemented the experiments, Yifeng Zhou, Bo An, and Yichuan Jiang wrote the main manuscript text, and Weiwei Wu analyzed the algorithm.

Corresponding author

Correspondence to Qian Che.

Ethics declarations

Conflict of interest

The authors declared that they have no Conflict of interest to this work.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, W., Che, Q., Zhou, Y. et al. Offline policy reuse-guided anytime online collective multiagent planning and its application to mobility-on-demand systems. Auton Agent Multi-Agent Syst 38, 19 (2024). https://doi.org/10.1007/s10458-024-09650-z

Download citation

Accepted: 19 April 2024
Published: 16 May 2024
DOI: https://doi.org/10.1007/s10458-024-09650-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Offline policy reuse-guided anytime online collective multiagent planning and its application to mobility-on-demand systems

Abstract

Access this article

Similar content being viewed by others

Load Balancing in Distributed Multi-Agent Path Finder (DMAPF)

Algorithms to Manage Congestion in Large-Scale Mobility-on-Demand Schemes that Use Electric Vehicles

Dispatching Strategies for Dynamic Vehicle Routing Problems

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Offline policy reuse-guided anytime online collective multiagent planning and its application to mobility-on-demand systems

Abstract

Access this article

Similar content being viewed by others

Load Balancing in Distributed Multi-Agent Path Finder (DMAPF)

Algorithms to Manage Congestion in Large-Scale Mobility-on-Demand Schemes that Use Electric Vehicles

Dispatching Strategies for Dynamic Vehicle Routing Problems

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation