A Communication Schema for Parallel and Distributed Multi-agent Systems Based on MPI
- 1.2k Downloads
The interest for Multi-Agents Systems (MAS) grows rapidly and especially in order to simulate and observe large and complex systems. Centralized machines do not however offer enough capacity to simulate the large models and parallel clusters can overcome these limits. Nevertheless, the use of parallel clusters implies constraints such as mono-threaded process of execution, reproducibility or coherency. In this paper, our contribution is a MPI based communication schema for Parallel and Distributed MAS (PDMAS) that fits High Performance Computing (HPC) on cluster requirements. Our communication schema thus integrates agent migration between processes and it guarantees message delivery in case of agent migration.
KeywordsMulti-Agent simulation Parallelism Communication schema
The interest for Multi-Agents Systems (MAS) grows rapidly and especially in order to simulate and observe large and complex systems. Centralized machines, like desktop computer, do not however offer enough capacity to simulate the model expected: their lack of memory or their processor is not powerful enough. Parallel machines like clusters or networks of workstations can overcome these limits. Nevertheless, using a cluster or a network of workstations implies management of some constraints, such as distribution, load balancing, migration, coherency or inter-processors communications, that we do not have on a single workstation. Efficiently using these platforms, to get good performance, also requires the relay of adapted software stacks. This clearly means that the MPI interface  must be used as a base for parallelism and communication.
Using MPI with its mono-threaded process execution model to run MAS is however a challenge. The contribution of this paper is a MPI based communication schema for Parallel and Distributed MAS (PDMAS) that fits High Performance Computing (HPC) on cluster requirements. Our communication schema thus integrates agent migration between processes and it guarantees message delivery in case of agent migration.
This article is organized as follows. In Sect. 2, we detail the related work on Multi-Agent Systems and parallel execution context. Then, in Sect. 3, we identify some lacks or limits in existing Parallel and Distributed Multi-Agent platforms. In Sect. 4, we give an overview of our proposition and we detail it in Sect. 5 for the communication schema and Sect. 6 for the proxy system used to follow mobile agents. We present the performance results obtained with our proposition, based on the same model used to make the survey on PDMAS , in Sect. 7. We finish the paper with conclusions and future work.
2 Related Work
Multi-Agent Systems are platforms which provide support to run simulations based on several autonomous agents . Among the most known platforms we can cite: NetLogo , MadKit , Mason  and Gama . For large models, these platforms are sometimes no longer sufficient to run simulations in terms of memory and computation power. This is, for example, the case in simulating individual behaviour of urban mobility  in a large city. In some cases increasing the size or the precision of models is however necessary to find emergent behaviours that we would never expect or never have seen otherwise. For this reason several Parallel and Distributed Multi-Agent Platforms exist such as RepastHPC , D-Mason , Pandora  and Flame  or JADE . These platforms provide a native support for parallel execution of models. That is to say, support for collaboration between executions on several physical nodes, distribution of agents, communications between agents and so on. All existing PDMAS platforms propose mechanisms to communicate between agents during the simulation. But, this is done only with agents which are executed on the same process or in the buffer zone, the zone shared between two adjacent processes. For example, RepastHPC proposes mechanisms to request a copy of a remote agent from other processes, but if the copy agent is modified, modifications are never reported in the remote agent. In other words there are no synchronization mechanisms to apply modifications nor communication mechanisms to communicate directly with remote agents. Only the Flame  platform allows to communicate with remote agents executed on other processes. To perform the inter-process communications, Flame platform uses its own communication library based on MPI which is called Message Board Library called libmboard. Each process which participates in the simulation has an instance of libmboard in order to make synchronizations and to perform communications. One of the advantages of the libmboard is that sending and receiving messages is a non-blocking process. It then allows much of the communication time to be overlapped with computations. The Flame platform however offers low performances on clusters compared to others platforms (RepastHPC, D-MASON...) as shown in . In addition Flame uses a proprietary programming paradigm (X-Machine) that could not be easily adopted by modellers used to standard languages as C or Java.
3 Implementing PDMAS on HPC Platforms
Targeting high performance computing implies some constraints on MAS implementation. Usually, PDMAS platforms are implemented using a Single Process Multiple Data (SPMD) programming paradigm in order to provide scalability. MAS simulations generally involve several tens of thousands of agents which potentially communicate with each other at each simulation time step. The communication bottleneck is thus a key problem as, in a parallel context, the running time is affected by the frequent communications. As the de facto standard of communication infrastructure in HPC cluster is the message passing interface (MPI) it is important to take care of the communication primitive properties in order to reduce the communication overhead. This constraint combines with another which is to have only a single mono-threaded process of execution on each allocated core. This constraint is imposed by most batch systems as SGE or SLURM. Using a single process of execution implies that we cannot use mechanisms like listener, onEvent or onMessage to communicate because we cannot dedicate one thread to wait for a message during the agent set execution. Messages must thus either be received by issuing non-blocking receives during the execution or at the end of the execution with a blocking receive. This illustrates the complexity of using a mono-threaded execution model to implement asynchronous communications.
Note however that, even with delivering messages at time step \(n+1\), the stochastic nature of MAS makes it difficult to provide an absolute guarantee during the simulation as illustrated on Figs. 1 and 2. Figure 1 shows a case of indeterminism of message receive order. Let \(p_1\), \(p_2\) and \(p_3\) be three processes executing a part of a simulation. If both \(p_2\) and \(p_3\) send a message at the same time to \(p_1\), we cannot know in which order we need to apply them on \(p_1\). On a centralized system these messages could be stamped with a clock value that differentiate them. In a parallel or distributed context we cannot relay on this clock value. Figure 2 illustrates the need for receive phases between time steps. Agent A1 is scheduled at the end of the time step and sends a message to agent A2 at that time. If the message is delayed on the network and agent A2 is scheduled at the beginning of the next time step. Then agent A2 may miss receiving its message at time step \(n+1\). For this reason, it is important to define receive phases at the end of each time step.
As underlined previously, the full functionality of being able to communicate with every process of the simulation, is only supported by Flame while other PDMAS limit it to the local process or neighbour processes. We advocate for inter-agent remote communications in PDMAS for two reasons. First, in models focusing on individual motions as in a city/urban mobility, agents may need to keep communicating with their contacts while moving. Due to the distribution of the simulation, agents could move anywhere on the environment, on different processes, and thus must be able to communicate with every process. Second, on graph based models, limiting agent communication to the neighbourhood leads to complex mapping constraints: non-planar graphs cannot easily be mapped on grids while keeping neighbourhood constraints.
From these reflections, we propose a communication schema for PDMAS in order to allow local and distant communications between agents without paying attention to agent location. Our communication schema allows reproducibility and guarantees that each message sent in time step n is received at the beginning of time step \(n+1\).
In Multi-Agent Simulations, agents must be identifiable. In PDMAS agents must be identifiable regardless of their process. For this reason we associate a System ID, inspired from RepastHPC and presented in Fig. 3, to each agent. This System ID is composed of four values: a global unique ID (GID), the ID of the process which created the agent (OwnProc), the ID of the process on which the agent is currently executing (ExeProc) and the ID of the agent type (Type). With this System ID, we can know at any time where an agent has been created and where the agent is currently executed.
ACs execute four different phases at each time step to implement all the communication requirements: (1) Run agent’s behaviours, (2) Receive messages, (3) Migrate agents, and (4) Update agents. In this paper we only focus on phases 2 and 4, that is to say sending and receiving messages and agent updates to perform communications even if they move on the environment. Phase 1 does not differs from other MAS and phase 3 is not necessary for understanding. The communication schema, the core of the contribution, is presented in the two next sections. Section 5 details the communication schema between agents and Sect. 6 details remote communication with mobile agents.
5 Communication Schema (Receive Message Phase)
As said previously, for coherency and reproducibility reasons, every message sent at time step n must to be received at the beginning of time step \(n+1\). Due to the stochastic nature of agents, we cannot however know how many messages must received, and from how many processes, at the end of a time step and so we cannot know when processes can proceed the next time step. For this reason we must use a termination algorithm. MPI proposes synchronization mechanisms like barrier, which are synchronization points, but they do not solve this problem. If we use a barrier to bound time steps, some faster processes will reach the synchronization point and then block until the last process reach it. Processes that did not reach the barrier could however send new messages that will not be processed by the recipient processes as they are blocked on the barrier. Thus messages are lost. This is the reason why mechanisms propose by MPI are no longer sufficient in this case.
To overcome this problem, we use a termination algorithm to reach an agreement between processes to switch to the next time step at the right time: when all processes have terminated to process the current time step. Our termination algorithm is based on a bi-directional ring with a coordinator. In our case, we use the bi-directional ring to check that all processes have terminated their messages receipt and that all processes can proceed to the next time step. We decided to use a bi-directional ring instead of a single ring, for efficiency because we divided the path in two executed in parallel.
By this way time steps are bounded by received messages phase. This algorithm works fine for agents without mobility. In case of agent mobility an AC needs to know where the target agent is run to deliver a message as agents may moves from a process to another. We explain the algorithm used to overcome this problem in the next section.
6 Proxy System (Agents Update Phase)
Multi Agent Systems often use mobile agents: agent that are not fixed on the environment but rather move. This is, for instance, the case of wolves and sheep in the classical prey-predator model. In PDMAS it is necessary to distribute the environment on several processes. With agent mobility agents may move from a process (or AC) to another (process or AC) to keep the continuity of the environment and thus perform its behaviour. If we want to send a message to an agent, we need to know on which process the agent is now run. To respect the single threaded process constraint of HPC context, we use a Proxy System (PS) which consists in letting a trace of each agent on the process which creates it at the beginning of the simulation. This trace is updated during the simulation.
Algorithm 2 details how the proxy for agents is updated. We update PS when an agent move from a process (or AC) to another (process or AC). Each AC contains an hashmap of proxies for each agent that it creates at the beginning of the simulation. When an agent is going to move, the container looks if this agent have been created by itself thanks to the System ID of agents. If it is an agent that the process (or AC) has created, the AC changes, in its proxy hashmap, the current process of the agent where the agent is executed to the future process on which the agent will be moved. On the other cases, the AC sends a message (that contains the future process on which the agent will be moved) to the creator of this agent in order to inform the AC that one of its agent move from a process (or AC) to another (process or AC).
In this section we present some results using our communication schema for Parallel and Distributed Multi-Agent simulation. We have implemented the communication schema in the Parallel and Distributed Multi-Agent Platform called FractalPMAS that we develop. To assess the performance of the communication schema, we have implemented a reference model defined and already implemented in most known PDMAS platforms in . This model respects the main properties that can be found in Multi-Agent Systems: perception, communication, mobility. In this model each agent is composed of 3 behaviours, performed at each time step: walk behaviour which allows agents to move in a random direction on the environment, interact behaviour which allows agents to interact and send messages to other agents and finally compute behaviour which allows agents to generate a workload.
In this reference model we also have implemented a way to evaluate the performance of our remote communication schema between agents. That is to say, instead of sending message to agents in the perception field, each agent sends messages to randomly chosen agents which are run on an other process if they are in its perception field.
About the HPC experimental settings: we have run the reference model on a 764 core cluster. Each node of the cluster is a bi-processor, with Xeon E5 (8*2 cores) processors running at 2.6 Ghz frequency and with 32 Go of memory. The nodes are connected through a non blocking DDR InfinyBand network organized in a fat tree.
As we can see both scale well, even if the local communications scales better. The difference between two speedup are not clearly noticeable. Obviously, remote communication offers lower performance due to the intensive exchanges between processes. Figure 7 represents the running time of simulations for 200 time steps and for local and remote communications.
Comparison of running time for platforms studied in  with 10,000 agents
FPMAS (local com.)
FPMAS (remote com.)
8 Conclusion and Perspectives
In this paper we have presented a communication schema for Parallel and Distributed Multi-Agent simulation that fit the constraints set by HPC systems. This communication schema is based on the MPI communication interface and allows communication with local and remote agents. Our contribution aims at proposing a communication schema which offers more efficiency while guaranteeing properties as reproducibility and coherency.
In our future work, we intend to better examine the efficiency of synchronization using our communication schema and also to improve the scalability of the communication schema. More improvements could be made in the implementation which is only a proof of concept. Then we will use this platform to assess load balancing in PDMAS.
Computations have been performed on the supercomputer facilities of the Mésocentre de calcul de Franche-Comté.
- 1.Angelotti, E.S., Scalabrin, E.E., Ávila, B.C.: Pandora: a multi-agent system using paraconsistent logic. In: Computational Intelligence and Multimedia Applications, ICCIMA 2001, pp. 352–356. IEEE (2001)Google Scholar
- 2.Bellifemine, F., Poggi, A., Rimassa, G.: Jade-a fipa-compliant agent framework. In: Proceedings of PAAM, vol. 99, p. 33, London (1999)Google Scholar
- 3.Chipeaux, S., Bouquet, F., Lang, C., Marilleau, N.: Modelling of complex systems with aml as realized in miro project. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 3, pp. 159–162 (2011)Google Scholar
- 4.Coakley, S., Gheorghe, M., Holcombe, M., Chin, S., Worth, D., Greenough, C.: Exploitation of hpc in the flame agent-based simulation framework. In: Proceedings of the 2012 IEEE 14th International Conference on HPC and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, HPCC 2012, pp. 538–545. IEEE Computer Society, Washington, DC (2012)Google Scholar
- 5.Collier, N., North, M.: Repast HPC: A Platform for Large-Scale Agentbased Modeling. Wiley, Hoboken (2011)Google Scholar
- 8.Gutknecht, O., Ferber, J.: Madkit: a generic multi-agent platform. In: Proceedings of the fourth international Conference on Autonomous agents, pp. 78–79. ACM (2000)Google Scholar
- 11.Rousset, A., Herrmann, B., Lang, C., Philippe, L.: A survey on parallel and distributed multi-agent systems. In: Lopes, L., et al. (eds.) Euro-Par 2014, Part I. LNCS, vol. 8805, pp. 371–382. Springer, Heidelberg (2014) Google Scholar
- 12.Taillandier, P., Vo, D.-A., Amouroux, E., Drogoul, A.: GAMA: a simulation platform that integrates geographical information data, agent-based modeling and multi-scale control. In: Desai, N., Liu, A., Winikoff, M. (eds.) PRIMA 2010. LNCS, vol. 7057, pp. 242–258. Springer, Heidelberg (2012) CrossRefGoogle Scholar
- 13.Tisue, S., Wilensky, U.: Netlogo: design and implementation of a multi-agent modeling environment. Proc. Agent. 2004, 7–9 (2004)Google Scholar