Planning and Learning in Multi-Agent Path Finding

Multi-agent path finding arises, on the one hand, in numerous applied areas. A classical example is automated warehouses with a large number of mobile goods-sorting robots operating simultaneously. On the other hand, for this problem, there are no universal solution methods that simultaneously satisfy numerous (often contradictory) requirements. Examples of such criteria are a guarantee of finding optimal solutions, high-speed operation, the possibility of operation in partially observable environments, etc. This paper provides a survey of modern methods for multi-agent path finding. Special attention is given to various settings of the problem. The differences and between learnable and nonlearnable solution methods and their applicability are discussed. Experimental programming environments necessary for implementing learnable approaches are analyzed separately.


INTRODUCTION
In the general form, the problem of multi-agent path finding is stated as follows. A group of mobile agents (e.g., mobile robots or virtual persons) operates in common space. Each agent is to move to a known goal position, avoiding collisions with the other agents or static and stochastic obstacles. Recently, interest in methods for solving this problem has increased significantly, mainly due to their applications in warehouse and service robotics [1] and in intelligent transportation systems [2].
Various assumptions made at the stage of problem formalization have a significant effect on the choice of solution methods. For example, one of the most widespread and actively studied formalizations is classical multi-agent path finding (classical MAPF). In this problem, it is assumed that there is a centralized controller that has complete information on the state of the environment and all agents (full observability). Time is assumed to be discrete, i.e., at every time step, an agent can perform a single action, namely, a move action or a wait action. The space is discretized in the form of a graph, i.e., the agents are assumed to move only along edges of an a priori given graph and to perform wait actions only at its nodes. Four-connected graphs (grids) are usually used in practice [3]. There are numerous variations of this graph-based centralized setting of the problem. For example, a variant in which the agents' goals are not fixed, i.e., the distribution of agents over goal positions is part of the problem solution is considered in [4]. In [5] it is assumed that each agent can have several goals and has to visit them sequentially. A lifelong problem is considered in [6], namely, after an agent reaches its goal, it is immediately assigned another (previously unknown) goal. Overall, despite the differences in formulation details, centralized variants of multi-agent path finding are usually solved by applying classical nonlearnable algorithms based either on heuristic search in the state space (in some form) [8][9][10] or on the reduction of the original problem to classical ones in computer science, for example, to the satisfiability of Boolean formulas (SAT) [11] or a network flow problem [12].
In addition to multi-agent path finding problems that assume full observability and centralized control, of interest, including in applications, is an alternative setting in which there is no centralized controller and agents can observe the environment (including other agents) only within a certain radius around them (socalled partial observability). This problem is reasonably formalized in the form of sequential decision making, when at every time step each agent choses to perform a single action relying on the current observation (and, possibly, on the history of observations and interactions with the environment). Reasonably, the problem in this setting is solved by applying reinforcement learning methods [13]. In what follows, methods for solving both classes of multi-agent path finding problems are considered in more detail.

NONLEARNABLE (CLASSICAL) METHODS
Nonlearnable methods for multi-agent path finding are usually used in the case of full observability, a centralized controller, and graph discretization of the agents' working space. The task is to construct a set of non-conflicting trajectories, namely, paths on a graph, including possible wait actions at vertices. It is well known that, on the one hand, in the case of an undirected graph, this problem can be solved in polynomial time [14]. On the other hand, obtaining optimal solutions is NP-hard [15]. If the graph is directed, then even obtaining a nonoptimal solution is an NPhard problem [16].
There are solution methods based on reducing this problem to other well-known problems in computer science. For example, multi-agent path finding (MAPF) is reduced to SAT in [11], to an integer programming problem in [17], and to a network flow problem in [12]. Among these methods, more widespread are those reducing MAPF to SAT. Likely, the cause is that numerous efficient solvers are available for SAT; as a result, the speed of solving the original problem is also fairly high. The following analogy is worth mentioning. Under certain assumptions, MAPF can be treated as a 15 puzzle game. This approach is used in modern algorithms, for example, in Push and Rotate [18], designed for fast obtaining nonoptimal solutions.
Another approach to the solution of MAPF is based on algorithms involving direct search on a graph. Obviously, heuristic versions of search are used to improve its efficiency. A classical heuristic search algorithm is A* from [7]. With certain modifications, it can be used to find optimal MAPF solutions [8], but overall this approach is not very efficient, since, in fact, it treats all agents as a single meta-agent and carries out search in a combined space with a branching coefficient depending exponentially on the number of agents. To avoid a combinatorial explosion, various decoupled search techniques are applied. Examples are the algorithms CBS [9] and M* [10]. Both guarantee the optimality of found solutions and have a variety of modifications, including ones aimed at improving computational efficiency, while preserving the optimality of the solution [19,20]; modifications that trade off optimality against computational efficiency [21]; and modifications solving MAPF under milder constraints, for example, in continuous time [22].
Another approach to MAPF solution based on heuristic search is prioritized planning [23]. In this case, each agent is assigned a priority and then only individual paths are sought. All earlier planned trajectories are considered unchangeable (in other words, dynamic obstacles for the current agent). Theoretically, this approach does not guarantee optimality; moreover, it does not even guarantee that the solution of the problem will be found if it exists. Nevertheless, such a guarantee can be given for a certain class of problems [24]. Moreover, in practice, prioritized algorithms find close-to-optimal solutions in numerous instances, while spending much less computational resources. That is why algorithms of this class are often used in robotics [25].

LEARNABLE METHODS
There are several variants of using machine learning methods in the context of MAPF with full observability and a centralized controller. First, these methods can be used to select an MAPF algorithm most suitable for a particular problem (map, positions of agents) [26,27]. Second, machine learning methods can be used to learn various heuristic selection rules involved in classical MAPF-solving algorithms [28,29]. In recent years, reinforcement learning methods have become widespread. They are able to solve MAPF in decentralized and partially observed settings. One of the first works in this direction was [30], where a learning strategy called PRIMAL was presented. Later, it was improved and generalized to lifelong search [31], when after reaching its goal, an agent does not finish the current episode, but receives a new task. Both algorithms used demonstration trajectories generated by the ODrM* search algorithm [32]. Algorithms of the PRIMAL family use a complicated reward function and make a large number of partial assumptions concerning specific conditions and maps (domain knowledge), for example, additional penalty for conflicts or the assumption that a local observation includes not only positions of other agents, but also their goals. Similar assumptions were used in [33], which proposes another learnable algorithm for MAPF, but in the case of more complicated dynamic models of agents (e.g., such as quadcopters). Learnable methods that use complete information on static elements of the environment (global information on the positions of other agents is not available to them) were proposed in [34,35].
In addition to algorithms developed specially for MAPF, there are universal approaches of multi-agent reinforcement learning that can be used to solve MAPF. Among the wide variety of classical algorithms for single-agent learning (which is called independent), as well-established one for partially observable multi-agent problems is the policy gradient approach, a popular implementation of which is known as proximal policy optimization (PPO) [36][37][38]. Another direction is centralized training in cooperative policies. Algorithms of this type are usually trained in a centralized fashion, using global information on the environment, while their testing is decentralized. For example, QMIX [39] uses hyper-networks for training individual policies via a mixing utility network. In training, each network receives only a local observation, and it is optimized by a hyper-network, to which the global state is available. The learning algorithms MADDPG [40] (off-policy learning) and MAPPO [41] (on-policy learning), the critic uses a centralized network. This is a general learning approach when the critic uses the global state of the environment for better training a utility function approximator. A policydetermining actor receives, as input, only a partial observation, but implicitly uses a general observation, using critic's estimates. The FACMAC algorithm [42] is a combination of MADDPG and QMIX, so it can be used for both discrete actions, using the Gumbel-Softmax trick, and for continuous actions.
MARL algorithms are rather strongly optimized for a number of environments, which have become classic for testing, for example, SMAC [43], which uses the StarCraft 2 game. In contrast to single-agent reinforcement learning, ready-for-use implementations in open access are much fewer, and available ones are suitable only for rather simple problems. The main cause is that they represent slow implementations without parallelization intended for only several millions of steps in the environment and for simple fully connected architectures as approximators. As a result, most researchers prefer using well-known decentralized approaches. However, this leads to another extreme, namely, the proposed algorithms exploit subject area knowledge, which limits their applicability to a broad class of MAPF problems.
A promising direction in the development of more advanced methods for MAPF can be model-based reinforcement learning [44]. Prediction of the other agents' policies and allowance for this model in the construction of its own agent policy can be especially useful in a heterogeneous group of agents, where each agent can have its own policy [45]. Another open niche in MARL is the use of demonstrations in learning. Indeed, demonstrations can significantly accelerate the learning process and can also allow using modern transformer and diffusion models. This is especially important for MAPF problems, for which there are strong planning algorithms.

EXPERIMENTAL ENVIRONMENTS FOR TESTING ALGORITHMS
Experimental online environments are hardly used in nonlearnable approaches to the solution of MAPF, since these approaches do not assume learning via the interaction with the environment. An opposite situation occurs in the reinforcement learning community, where there are numerous environments, but most of them are intended for games and are characterized by numerous additional features that have nothing to do with MAPF (e.g., stocks, counteracting opponents, etc.) (see Fig. 1).
An example of a game environment is Neu-ralMMO [46], which is a simplified version of a multiplayer network game with a group of agents solving the task of survival and resource accumulation. A team of eight agents competes with other 15 teams on a procedurally generated map of 128 by 128 cells. The environment is partially observable, but the agents can communicate with each other. Although this problem is rather complicated, it is far from practical application and requires mainly reactive choices of actions based on a set of rules, rather than planning or path finding.
A well-known environment designed specifically for MAPF is Flatland [48], which is a simplified, yet realistic environment for scheduling railway networks. Here, agents are trains, which are to move from one station to another in a single-track railway, avoiding conflicts with each other. Within the framework of this problem, several competitions have been conducted in order to research reinforcement learning algorithms. However, it turned out that access to the full state of the environment provides a significant advantage for planning and replanning approaches [49]. Another shortcoming of this environment is that it works very slowly (near 200 steps per second for small maps) in the regime of observations intended for learnable algorithms.  MAGENT is a set of environments from the Pet-tingZoo library [47]. It is designed for modeling the role behavior of agents capable of moving from one place to another and interacting with each other in various ways. The implementation in C++ significantly accelerates the interaction, but this environment possesses a limited set of scenarios (map types) and has no interface for testing solutions based on approaches other than reinforcement learning.
The most suitable environment for MAPF problems is POGEMA [50], which was specially designed for problems in partially observable setting on cellular maps. The authors emphasize that agents receive information only from a bounded space around them and cannot transfer information to each other, which considerably complicates the problem for both planning and learnable algorithms. The main advantages of this environment are its flexibility and the performance speed. POGEMA allows for any user-created maps of obstacles and supports three regimes determined by what happens after an agent reaches its goal: the agent receives a new goal (lifelong path search), agents that disappear (after reaching goals), and agents that do not disappear until the end of an episode.

CONCLUSIONS
In recent years, methods for MAPF have been actively developed as motivated by their applications in various practical areas (warehouse robotics, transportation systems, etc.). Under centralized control and full observability, the solution approach usually involves nonlearnable methods based on heuristic search or on reducing MAPF to other classical problems in computer science (SAT, network flow, etc.). In the case when centralized control is absent and/or full information on the environment is not available to agents, one often applies reinforcement learning methods based on either adaptation of well-known search strategies for individual agents or on the "centralized training-decentralized execution" paradigm. In our view, the most promising (and least studied) would be a combined approach involving both reinforcement learning and classical planning methods (heuristic search, etc.).

CONFLICT OF INTEREST
The authors declare that they have no conflicts of interest.

OPEN ACCESS
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.