The PageRank algorithm as a method to optimize swarm behavior through local analysis
Abstract
This work proposes PageRank as a tool to evaluate and optimize the global performance of a swarm based on the analysis of the local behavior of a single robot. PageRank is a graph centrality measure that assesses the importance of nodes based on how likely they are to be reached when traversing a graph. We relate this, using a microscopic model, to a random robot in a swarm that transitions through local states by executing local actions. The PageRank centrality then becomes a measure of how likely it is, given a local policy, for a robot in the swarm to visit each local state. This is used to optimize a stochastic policy such that the robot is most likely to reach the local states that are “desirable,” based on the swarm’s global goal. The optimization is performed by an evolutionary algorithm, whereby the fitness function maximizes the PageRank score of these local states. The calculation of the PageRank score only scales with the size of the local state space and demands much less computation than swarm simulations would. The approach is applied to a consensus task, a pattern formation task, and an aggregation task. For each task, when all robots in the swarm execute the evolved policy, the swarm significantly outperforms a swarm that uses the baseline policy. When compared to globally optimized policies, the final performance achieved by the swarm is also shown to be comparable. As this new approach is based on a local model, it natively produces controllers that are flexible and robust to global parameters such as the number of robots in the swarm, the environment, and the initial conditions. Furthermore, as the wallclock time to evaluate the fitness function does not scale with the size of the swarm, it is possible to optimize for larger swarms at no additional computational expense.
Keywords
Swarm robotics Microscopic Micro–macro link Evolutionary algorithm Pattern formation Consensus Aggregation Local PageRank Centrality1 Introduction
 1.
The computational load required to execute the simulations increases with the size of the swarm;
 2.
It may take longer for the desired behavior to emerge, requiring a longer simulation time for each evaluation trial, especially in the initial stages of the evolution;
 3.
The evolved policy may be overfitted to the global parameters used during the simulation, such as the number of robots, the initial conditions, or the environment;
 4.
A solution needs to be simulated multiple times in order to reliably assess the expected performance of a given behavior (Trianni et al. 2006). Avoiding reevaluation may result in poor behaviors being erroneously assigned a higher fitness thanks to one lucky run, which may ultimately result in a performance drop (Di Mario et al. 2015a, b);^{1}
 5.
Consensus agreement (Sect. 4) In this task, we optimize the behavior of a swarm that must achieve consensus between multiple options. Each robot can sense the opinion of its neighbors and, based on a stochastic policy, decide whether it should change its opinion (and, if so, what to change its opinion to). Using PageRank, we optimize the stochastic policy so as to help the swarm achieve consensus as quickly as possible. The policy is optimized independently of the size of the swarm or its spatial configuration. We further optimize a more limited variant of this task whereby robots can not sense the opinion of their neighbors, but can only sense whether they are in agreement with their neighborhood or not.
Pattern formation (Sect. 5) For this task, we optimize the performance of a swarm of robots with the global goal of arranging into a desired spatial configuration. The robots in the swarm have very limited knowledge of their surroundings. This section extends our recent work published in Coppola and de Croon (2018), where we first attempted to optimize the behavior of a swarm in a pattern formation task, yet quickly encountered scalability problems for larger swarms/patterns. The scalability problems were a result of the fact that the swarm had to be simulated in order to assess the efficacy of a controller. This was infeasible for larger swarms, especially in early generations where performance is poor. Using the PageRank algorithm, we can now tackle these scalability issues, as it is no longer needed to simulate the swarm in order to assess the fitness of a behavior.
Aggregation (Sect. 6) In this task, we study a swarm of robots in a closed arena which should aggregate in groups of three or more. In comparison with the other two tasks, this optimization tunes a higherlevel policy featuring two subbehaviors (random walk and stop) with a probability that is dependent on the number of neighbors that a robot in the swarm can sense. The final achieved policy allows the swarm as a whole to aggregate successfully.
2 Context and related work
In state of the art, the problems of scalability discussed in the introduction have mostly been tackled in two ways. First, there are methods that try to deal with the broad solution space that comes as the number of robots increases. For example, Gomes et al. (2012) used novelty search to encourage a broader exploration of the solution space. The second way is to use global insights to aid the evolutionary process. For example, Duarte et al. (2016) partitioned complex swarm behavior into simpler subbehaviors. Hüttenrauch et al. (2017), with a focus on deep reinforcement learning, used global information to guide the learning process toward a solution. Alternatively, Trianni et al. (2006) and Ericksen et al. (2017) explored whether evolved behaviors for smaller swarms could generalize to larger swarms. In all cases, the need to simulate the swarm remains a bottleneck, for both evaluating the behavior and generalizing the behavior beyond the parameters used in simulation. In this work, we offer a different solution which discards simulation and exploits a micro–macro link based on evaluating the relative PageRank score between local states using only a local model of a single robot in the swarm. It extracts performance parameters without simulation or propagation from an initial condition and only relies on an analysis of the model for a given policy.
This approach differs from other swarm modeling and evaluation solutions found in the literature, such as the multilevel modeling framework introduced by Lerman et al. (2001), Martinoli and Easton (2003), and Martinoli et al. (2004). There, the idea is to model the evolution of a swarm via probabilistic final state machines propagated in time. At the microscopic level, one probabilistic finite state machine is propagated from an initial condition for each robot in the swarm. At the macroscopic level, which can be readily abstracted from the microscopic model, the finite state machine describes the mean transition of robots between states. The macroscopic model probabilistically describes, at the global level, how many robots are in each state at a given point in time. This can also be expressed in terms of rate equations. For each level, the relevant transition rates between states are extrapolated from an analysis of the policy, as well as from geometric reasoning (e.g., based on the size of the arena and the expected robot density that ensues) (Martinoli et al. 2004) or from empirical data (Berman et al. 2007). To predict the global evolution of the swarm, the models are propagated in time from an initial condition, essentially simulating the swarm at an abstract level. For certain tasks, these models have been shown to be highly effective in predicting the general behavior of a swarm (Lerman et al. 2005). However, their accuracy and applicability are limited by the validity of the global assumptions which define the transition rates. For instance, to estimate the probability of interaction between any two robots, one assumption is that the system is “well mixed,” meaning that the robots are all equally distributed within an area at all times and always have the same probability of encountering a neighbor. For reasons such as this, their use has been largely limited to the evaluation of tasks with few states and by swarms in bounded arenas. Examples where these models have been used are the modeling of collective decision making (Hamann et al. 2014; Reina et al. 2015), area exploration/foraging (Correll and Martinoli 2006; Campo and Dorigo 2007), or keeping an aggregate (Winfield et al. 2008).^{2} For other tasks where the level of detail required is higher and/or where the global goal is achieved by a combination of several states, rather than all robots being in one state, this approach does not provide a sufficient level of detail. One such example is pattern formation, as treated in this paper, whereby the global goal is to achieve a particular spatial arrangement which cannot be described by such models.
Another class of macroscopic models, focused on spatial movements, models a swarm by using a diffusion model based on the Fokker–Planck equation (Hamann and Wörn 2008). This approach macroscopically models the general motion of a swarm under the assumption of Brownian motion. The Fokker–Planck equation provides an estimate for the density of robots in the environment as a result of the robots’ motion. Prorok et al. (2011) explored the use of rate equations together with a diffusion model. This made it possible to study swarms in more complex environments where the swarm may not be well mixed, which otherwise causes drift errors (Correll and Martinoli 2006). However, there were still limitations on how to approximate more complex behaviors by the robots, which, by the nature of the assumption of Brownian motion, were limited to random walks.
The use of macroscopic models to optimize the behavior of robots in a swarm was explored by Berman et al. (2007, 2009, 2011) using models similar to those of the rate equations by Martinoli and Easton (2003). In these models, the fraction of robots in each state is modeled together with the rates at which the robots will transition between these states. It is then possible to optimize the transition rates such that the macroscopic model, on average, shows that the swarm settles in certain states of interest. In (Berman et al. 2009) this is a done for the problem of task allocation, where it is also proven that all robots in the swarm will settle to a desired global equilibrium. However, as for rate equations, this approach is limited by the global level assumptions that are being taken in order to describe the mean of the swarm (Berman et al. 2011). Moreover, the outcome from such approaches can also be dependent on the initial condition of the swarm (Hsieh et al. 2008).
Macroscopic approaches thus make it possible to approximate how a swarm can evolve in time. They purposely refrain from incorporating the detailed experience of a single robot and rather approximate the mean evolution of the entire swarm. This makes them effective prediction tools, but they are limited in their ability to describe interactions to a higher level of detail. In contrast to multilevel models, the PageRank framework captures state transitions probabilistically, rather than temporally. We can apply this to analyze the impact of the actions by a single robot in a dynamic environment. This makes it able to tackle more specific tasks such as consensus agreement or pattern formation, with results that are found to be scalable, flexible, and robust to initial conditions or the number of robots.
3 PageRank centrality as a micro–macro link for swarm robotics
Centrality measures assess the relative importance of nodes in a graph based on the graph’s topology. Several of these measures exist, which capture centrality from different perspectives. For example, degree centrality computes the importance of nodes based on the number of edges connected to them. Alternatively, closeness centrality measures the average shortest path between a node and all other nodes.^{3} This paper deals with PageRank centrality. This is a graph centrality measure for directed graphs that measures the importance of each node recursively. This means that the PageRank centrality of a node is a function of the centrality of the nodes pointing to it, the centrality of the nodes pointing to those nodes, and so forth. This recursiveness indirectly accounts for the topology of the entire network, and it models how likely the node is to be reached when traversing the graph.
In this section, we detail how the PageRank centrality algorithm works (Sect. 3.1) and then explain how it can be used to microscopically model the possible state transitions of a robot in a swarm (Sect. 3.2). In Sect. 3.3, we then introduce a PageRankbased fitness function. This fitness function assesses, at a local level, a swarm’s ability to achieve a desired global goal. It will be used to optimize the behavior for all tasks treated in this paper.
3.1 A review of PageRank centrality
\(\mathbf {H}\) is the adjacency matrix of graph G. \(\mathbf {H}\) is a substochastic transition probability matrix that describes the probability of transitioning between the nodes of graph G. For the example of the World Wide Web, \(\mathbf {H}\) holds the hyperlink structure of the Web and it models how a Web surfer can navigate the Web by using hyperlinks.
\(\mathbf {D}\) is the dangling node matrix. The dangling nodes of graph G are nodes with no outgoing edges (e.g., no hyperlinks), which result in empty rows in \(\mathbf {H}\). For the World Wide Web, if a user reaches a Webpage with no hyperlinks, then the user will resort to writing the name of a Webpage in the address bar. \(\mathbf {D}\) describes these transitions and the probabilities thereof. The combined matrix \(\mathbf {S}=\mathbf {H}+\mathbf {D}\) is a stochastic matrix.
\(\mathbf {E}\) is known as the teleportation matrix. This is an additional element that describes random transitions between nodes that are not captured by the topology of the graph G. For the World Wide Web, these transitions model how a Web surfer may choose, at any moment and on any Webpage, to manually type the address of a Webpage in the address bar instead of clicking through hyperlinks. In this case, the user will “teleport” to another Webpage regardless of whether a hyperlink to that Webpage exists. Note that matrices \(\mathbf {D}\) and \(\mathbf {E}\) have a nonnull intersection. The information in \(\mathbf {D}\) is included in \(\mathbf {E}\). Both matrices describe transitions that occur as a result of a user typing the name of a Webpage in the address bar, except that \(\mathbf {D}\) only holds those transitions for the cases where the user, having reached a dangling node, has no other option than to teleport.
\(\alpha \) is known as the expansion factor, where \(0\le \alpha \le 1\), which models the probability that a user follows hyperlinks as opposed to accessing a Webpage directly via the address bar. Note how Eq. 3 consists of two complementary terms. The first term models the transitions via hyperlinks (unless no hyperlinks exist as in the case of dangling nodes). The second term models “teleportation,” described via \(\mathbf {E}\). \(\alpha \) describes the relative probability between these two transition types. If \(\alpha =1\), the user always only follows hyperlinks (when they are available). If \(\alpha =0\), the user never follows hyperlinks and only teleports through the Web by typing Webpages in the address bar. Brin and Page (1998) advised to set \(\alpha =0.85\). Note that the simplified version of PageRank from Eq. 1 featured \(\alpha =1\).
3.2 Using PageRank to model swarms
Let \({\mathcal {S}}\) be the local state space of a robot. This is the set of local, discretized states that a robot can observe using its onboard sensors. The local states are going to be analogous to Webpages.
Let \({\mathcal {A}}\) be the set of all discrete local actions that a robot can take.
Let \({\Pi }\) be a stochastic policy (a stochastic map between states \({\mathcal {S}}\) and actions \({\mathcal {A}}\)) that the robot follows. This stochastic policy is analogous to the hyperlinks in the Web, as it allows robots to travel through local states.
\(G_{\mathcal {S}}^{a}\), whereby the superscript a stands for “active,” holds all state transitions that a robot could go through by an action of its own based on the stochastic policy \({\Pi }\). The edges of this graph are weighted based on the relative probabilities in the stochastic policy \({\Pi }\) and thus represent how likely it is that the robot will take the action when in a given local state.
\(G_{\mathcal {S}}^{p}\), whereby the superscript p stands for “passive,” holds all state transitions that a robot could go through because of changes in its environment, independently from \({\Pi }\).
 \(\mathbf {H}\) shall model how a robot navigates in its environment by its own actions, describing the possible local state transitions that may take place when the robot executes these actions. This is analogous to a user navigating through Webpages via hyperlinks. We thus define:where \(\mathrm {adj}(G)\) denotes the weighted adjacency matrix of a graph G.$$\begin{aligned} \mathbf {H} = \mathrm {adj}(G_{{\mathcal {S}}}^{a}) , \end{aligned}$$(4)
 \(\mathbf {E}\) models the “environment.” These are state transitions that can happen to the robot even when it is not taking an action. For example, a neighbor could move away and cease to be a neighbor. This is analogous to a Web user who, instead of following hyperlinks through the Web, navigates to another Webpage via the address bar. In this case, the user instigates a state transition that is not described by the hyperlink structure (i.e., the “policy”) of the World Wide Web. The matrix \(\mathbf {E}\) is thus given by:$$\begin{aligned} \mathbf {E} = \mathrm {adj}(G_{{\mathcal {S}}}^{p}) . \end{aligned}$$(5)
 \(\mathbf {D}\) shall model what can happen to a robot as a result of the changing environment around the robot, if and only if the robot reaches a state wherein it cannot (or will not) take any actions. These states are states that the policy \({\Pi }\) does not map to any actions. We group these states in the set \({\mathcal {S}}_{static}\subseteq {\mathcal {S}}\). Reaching such a state is analogous to reaching a Webpage with no hyperlinks. We define \(\mathbf {D}\) as:$$\begin{aligned} \mathbf {D} = \mathrm {adj}(G_{{\mathcal {S}}}^{p} ({S}_{static})). \end{aligned}$$(6)
The expansion factor \(\alpha \) remains to be defined, which models the probability that a robot will take an action over being subject to changes in the environment, as modeled by \(\mathbf {E}\). One may choose to keep \(\alpha \) constant, as originally suggested by Page et al. (1999). For instance, \(\alpha \) could be a constant value that is a function of the robot density. However, this would be a global parameter and would thus not be representative of the local experience of the robots, unlike the remainder of the model. Instead, we propose to make \(\alpha \) a function of the local state that the robot is in. For instance, this may reflect that, if the robot is in a local state whereby it is surrounded by several neighbors, the probability that it undergoes passive transitions due to actions of its neighbors may be higher than if it is in a state whereby it is not surrounded by neighbors at all. While exploring the three tasks analyzed in this paper, we will define different approaches to express \(\alpha \) in more detail.
3.3 Using PageRank to evaluate the performance of the swarm
Once a microscopic model of the swarm has been constructed using the framework described in Sect. 3.2, we can evaluate the PageRank centrality of all states in \({\mathcal {S}}\). This will inform us on how likely it is for a robot in the swarm to reach each local state. In turn, this will be used to evaluate the performance of the whole swarm.
This is where we devise the micro–macro link based on PageRank centrality. The true metric that we wish to optimize is the efficiency of the swarm in achieving a given global goal. However, directly assessing this parameter would require us to simulate the swarm (which may lead to the scalability problems introduced at the beginning of this paper), or use a macroscopic model of the swarm (which may fail to capture the details of the policy). Therefore, instead of directly optimizing the stochastic policy \({\Pi }\) against the global performance metric, we optimize the likelihood that a robot in the swarm will end up in certain local states of interest, i.e., local states that we know should be reached for the global goal to happen. For example, in the aggregation task, we will wish to achieve (and stay in) local states with neighbors over local states without neighbors. Alternatively, in the consensus task, the robot will aim to be in a state of agreement with its neighbors. Because all robots in the swarm are trying to achieve these “desired” local states as efficiently as possible, the global effect will then be that the swarm achieves the global goal more efficiently as well.
4 Sample task 1: consensus agreement
In this first task, a swarm of robots needs to achieve a consensus between a certain set of options. We will use the fitness function described in Sect. 3.3 to optimize the probability with which a robot should change its opinion such that the entire swarm achieves consensus as quickly as possible.
4.1 Task description and setting
Consider a set of N robots that must collectively choose between M options. As an example, these options could be possible sites to visit together with the rest of the swarm. We focus on the case where all options are of equal value, meaning that the robots do not have a preference for any particular option. Let \({\mathcal {C}}\) be the set of M options that the swarm considers. The global goal of the swarm is for all robots to settle on a choice \(c\in {\mathcal {C}}\). We shall assume that the robots are in a static arbitrary connected configuration P.
 1.
The internal opinion of \({\mathcal {R}}_i\), denoted \(c_i\).
 2.
The number of neighbors of \({\mathcal {R}}_i\) with opinion A, denoted \(n_{iA}\).
 3.
The number of neighbors of \({\mathcal {R}}_i\) with opinion B, denoted \(n_{iB}\).
Based on its local state \(s_i\), a robot can choose to keep its opinion or change it. We set the action space \({\mathcal {A}}\) to \({\mathcal {A}}={\mathcal {C}}\). Each robot follows a stochastic policy \({\Pi }\), dictating the probability of choosing an opinion \(c\in {\mathcal {C}}\) (including its current one) for each state \(s\in {\mathcal {S}}\). We then define the set \({\mathcal {S}}_{des}\) to be all states wherein the robot has the same opinion as all of its neighbors, for which we know a priori that there is no need to change opinion, as all robots are locally in agreement. These states are excluded from the stochastic policy \({\Pi }\), creating a global convergence point for the swarm. If all robots are in one of these local state, then it follows that consensus (which is the global goal) has been achieved. The size of the stochastic policy is \({\Pi }  = \left( {\mathcal {S}}   {\mathcal {S}}_{des}  \right) * {\mathcal {A}} \).
4.2 PageRank model
Let us begin by defining \(G_{\mathcal {S}}^a\). In this case, whenever a robot \({\mathcal {R}}_i\) changes its opinion between the M options that are available, only its internal choice changes, while the choice that its neighbors hold is not (directly) impacted by the action. Thus, for the example of \(M=2\), the only parameter of its state that changes is \(c_i\), while \(n_{iA}\) and \(n_{iB}\) remain constant. \({G}_{S}^a\) is thus formed by a set of disconnected subgraphs, each where the robot is only capable of changing its own internal state, but all other parameters stay constant. A sample subgraph of \(G_{\mathcal {S}}^a\) for \(M=2\) is depicted in Fig. 1a.
\({G}_{S}^p\) follows the same philosophy as \({G}_{S}^a\), but from the opposite perspective. The internal choice \(c_i\) of a robot \({\mathcal {R}}_i\) does not change when its neighbors change opinions. What changes is the number of neighbors with a certain opinion. The model assumes that the robot will always notice every time one of its neighbors changes opinion. \({G}_{S}^p\) thus also takes the form of several disconnected subgraphs. An example of a subgraph for \(M=2\) is depicted in Fig. 1b for a robot with three neighbors. The matrices \(\mathbf {H}\), \(\mathbf {E}\), and \(\mathbf {D}\) are extracted from \(G_{\mathcal {S}}^a\) and \(G_{\mathcal {S}}^p\) using Eqs. 4, 5, and 6, respectively.
4.3 Genetic algorithm setup and results
The optimization in this work is done via a genetic algorithm (GA). The GA features a population of ten scalar genomes, where the size of each genome is equal to \({\Pi }\). Each gene in a genome holds a value \(0\le p \le 1\), indicating the probability of executing the corresponding state–action pair from \({\Pi }\). Each new generation is produced by elite members (30%), offspring (40%), and mutated members (30%). Offspring are generated from averaging two parents’ genomes. Mutation replaces 10% of a genome’s genes with random values from a uniform distribution. The initial generation was created by assigning random values between 0 and 1 to all genes in each genome, following a uniform distribution. Note that the cumulative probability of taking any of the actions when in a given state is always normalized, since all choices are part of the policy.
With this setup, we evolved behaviors for the case where \(M=2\) and \(M=3\). The results of the fitness increase over five evolutionary runs are shown in Fig. 2, where we show the best genome that was evolved for both the case where \(M=2\) (Fig. 2a) and the case where \(M=3\) (Fig. 2b).
We now test the results of the evolved behavior to examine the magnitude by which the global performance of the swarm has improved. To do so, the swarm is simulated in discrete time whereby, at each time step, a random robot in the swarm changes its opinion. This models random concurrency in the swarm. For simplicity, we assume that the robots are placed on a grid world. Each robot \({\mathcal {R}}_i\) is capable of sensing the opinion of any robot \({\mathcal {R}}_j\) that happens to be in the eight grid points that surround it. We measure the performance of the swarm as the cumulative number of times that the robots in the swarm take an “action” before consensus is achieved, whereby an “action” is whenever a robot renews its opinion. This is indicative of how quickly the swarm is capable of reaching a consensus to one of the options. The results are compared to a basic baseline behavior where the robots choose between options with equal probability (except for the states \({\mathcal {S}}_{des}\), in which case they are in full agreement with their neighborhood and remain idle). In the following, the baseline behavior is denoted \({\Pi }_0\) and the evolved behaviors are denoted \({\Pi }_1\).
4.4 Variant with limited binary cognition
 \(s_{A0}\):
Internal opinion A, but the robot is not in agreement with all neighbors.
 \(s_{A1}\):
Internal opinion A, and the robot is in agreement with all neighbors.
 \(s_{B0}\):
Internal opinion B, but the robot is not in agreement with all neighbors.
 \(s_{B1}\):
Internal opinion B, and the robot is in agreement with all neighbors.
 \(a_{A}\):
Select opinion A
 \(a_{B}\):
Select opinion B
Baseline policy and evolved policy for the binary variant of the consensus task for the case of \(M=2\)
\({\Pi }_0\)  \(a_A\)  \(a_B\) 

(a) Baseline policy \({\Pi }_0\)  
\(s_{A0}\)  0.5  0.5 
\(s_{B0}\)  0.5  0.5 
\({\Pi }_1\)  \(a_A\)  \(a_B\) 

(b) Evolved unbiased policy \({\Pi }_1\)  
\(s_{A0}\)  0.002  0.998 
\(s_{B0}\)  0.999  0.001 
4.5 Analysis
The small size of the solution space for the variant of the consensus task studied in Sect. 4.4 provides an opportunity to analyze the evolved policy and the global fitness landscape in detail. The baseline policy and the evolved policies are given in Table 1a, b, respectively. In Table 1b, it can be seen that our evolutionary procedure determined that the best strategy would be for the robots to, almost deterministically, switch their opinion.^{6} We investigated whether this was an optimum solution by simulating the system for different probabilities of switching opinion. This was done using the same simulation setup as all evaluations in this section, whereby each setting was evaluated and averaged over 100 runs.
A first set of results for these simulations is shown in Fig. 6a, which shows the performance of the swarm in relation to the probability of switching. Here, it is confirmed that the policy is correct. As the probability of switching choices increases, the swarm is capable of achieving consensus more efficiently. It can also be seen that the evolved policy scales well with the size of the swarm.
We further investigated the entire solution space for a swarm of ten robots. Here, it is revealed that the evolved policy \({\Pi }_1\), as given in Table 1b, is not the global optimum, but the unbiased global optimum. By unbiased, we mean that the robots are equally probable of selecting option A as they are of selecting option B. However, more optimal options exist if the robots become biased toward one of the two options. This result is actually reflected in the PageRankbased fitness function with which the behavior was evolved, as we had provided the set \({\mathcal {S}}_{des} = \{s_{A1}, s_{B1}\}\) and gave equal importance to both options. It then follows that this brought us to the unbiased optimum.
To test whether our framework could adapt to this, we also evolved the policy for the cases where \({\mathcal {S}}_{des} = \{s_{A1}\}\) and where \({\mathcal {S}}_{des} = \{s_{B1}\}\). Note that we kept \({\mathcal {S}}_{static} = \{s_{A1}, s_{B1}\}\), because we still want the robots to stop if they are in agreement with their neighborhood. The results of these two evolutions are given in Table 2a, b for \({\mathcal {S}}_{des} = \{s_{A1}\}\) and \({\mathcal {S}}_{des} = \{s_{B1}\}\), respectively. The evolution discovered the biased optimum for both options. This shows a flexibility on the part of the algorithm to adapt to the desired goals. Originally, the algorithm had evolved the unbiased policy because it had been told that both options had equal importance. When this constraint was altered to a new desired goal, the algorithm evolved a policy which reflected the new goal.
Evolved policies with a bias toward options A or B
\({\Pi }_{bA}\)  \(a_A\)  \(a_B\) 

(a) Policy \({\Pi }_{bA}\) with bias to option A  
\(s_{A0}\)  0.9982  0.0018 
\(s_{B0}\)  0.9982  0.0018 
\({\Pi }_{bB}\)  \(a_A\)  \(a_B\) 

(b) Policy \({\Pi }_{bB}\) with bias to option B  
\(s_{A0}\)  0.0004  0.9996 
\(s_{B0}\)  0.0042  0.9958 
5 Sample task 2: pattern formation
This task deals with a swarm of robots with low awareness of their surroundings that must arrange into a desired spatial configuration. The robots are homogeneous (identical and without hierarchy), anonymous (do not have identities), reactive (memoryless), cannot communicate, do not have global position information, do not know the goal of the swarm, and operate asynchronously (i.e., by local clocks) in an unbounded space. The only knowledge available to the robots is: (1) a common heading direction, such as north, and (2) the relative location of their neighbors within a maximum range. In previous work (Coppola et al. 2019), we have developed a local behavior with which such limited robots can always eventually arrange into a global desired pattern. Moreover, we provided a set of conditions to verify whether a given behavior would always eventually lead to the emergence of the desired pattern.
One issue that remained to be solved was that even simple patterns were found to take up to hundreds of actions (cumulatively by all robots) to emerge, and this number appeared to grow with the complexity of the pattern and the size of the swarm. Although this is to be expected, in light of the limited knowledge on the part of the robots, it is an issue that cannot be ignored if the system is to be used on real robots with limited battery life and other realworld time constraints. Solving this via classical approaches, wherein the swarm had to be ultimately simulated in order to find an optimum policy, proved unscalable (Coppola and de Croon 2018). This was because of all the issues listed in the introduction of this paper. In this section, we show how we can circumvent the scalability problem by tackling the optimization using the PageRank algorithm.
5.1 Description of approach to pattern formation
Such that this manuscript may be selfcontained, this section summarizes the pattern formation methodology from Coppola et al. (2019), wherein a more detailed explanation can be found. For the sake of brevity, in this work we will assume that the swarm operates in a grid world and in discrete time. However, as demonstrated in Coppola et al. (2019), the behavior can also be used in continuous time and space with robots operating with local clocks.
5.1.1 Setting
The goal of the swarm is to rearrange from its initial arbitrary pattern \(P_0\) into a desired pattern \(P_{des}\). This is achieved using the following principle. The local states that the robots are in when \(P_{des}\) is formed are extracted, and form a set of local desired states \({\mathcal {S}}_{des}\subseteq {\mathcal {S}}\), as depicted by the examples in Fig. 9. These local desired states are extracted as the observations of the robots once the pattern is formed, similarly to how puzzle pieces form a puzzle. If robot \({\mathcal {R}}_i\) finds itself in any state \(s_i\in {\mathcal {S}}_{des}\), then it is instructed to not move, because, from its perspective, the goal has been achieved. Given a \(P_{des}\) and the corresponding \({\mathcal {S}}_{des}\), it can be automatically (and sometimes even intuitively) verified whether the local desired states will uniquely form \(P_{des}\), or whether they can also can give rise to spurious global patterns. If spurious global patterns are not possible, then, until \(P_{des}\) is formed, at least one robot will be in a state \(s\not \in {\mathcal {S}}_{des}\) and will seek to amend the situation. The swarm will then keep reshuffling unless \(P_{des}\) forms (Coppola et al. 2019).
5.1.2 Baseline behavior of the robots
When a robot \({\mathcal {R}}_i\) is in a state \(s_i\not \in {\mathcal {S}}_{des}\), it should execute an action from \({\mathcal {A}}\). From the state space and action space, we can formulate a stochastic policy \({\Pi }={\mathcal {S}} \times {\mathcal {A}}\). However, not all actions should be allowed. The actions that: a) cause collisions and b) cause local separation of the swarm are eliminated from \({\Pi }\), because they are not “safe.” Moreover, as explained in Sect. 5.1.1, all states \(s\in {\mathcal {S}}_{des}\) do not take any actions. From this, we extract a final stochastic policy \({\Pi }_0\), where \({\Pi }_0\subseteq {\Pi }\). When a robot is in a state s, it will use the policy \({\Pi }_0\) to randomly select one possible action from the available state–action pairings. In Coppola et al. (2019), the final action was chosen from the available options based on a uniform probability distribution.
It is important to note that, from this pruning process, there also emerge additional local states that cannot take any actions. A robot in such a state will not be able to move or else it will either collide with other robots or possibly cause separation of the swarm. We refer to such states as blocked states. The set of blocked states is denoted \({\mathcal {S}}_{blocked}\). The states in \({\mathcal {S}}_{des}\) and \({\mathcal {S}}_{blocked}\) are functionally equivalent. In either case, a robot will not take any action. Together, they form the umbrella set \({\mathcal {S}}_{static} = {\mathcal {S}}_{des} \cup {\mathcal {S}}_{blocked}\).
Furthermore, conceptually in contrast to static states, there are states where a robot will be capable of moving away from and/or around its neighborhood without issues. We call these states simplicial. Simplicial states are characterized by only having one clique, where we define a clique as a connected set of neighbors.^{7} The set of simplicial states is denoted \({\mathcal {S}}_{simplicial}\). Figure 10 shows examples of blocked states (Fig. 10a, b), a simplicial state (Fig. 10c), and a nonsimplicial state (Fig. 10d).
5.1.3 Modeling the local experiences of the robots
Let us now construct the graph \(G_{\mathcal {S}} = (V,E)\) that models the local experiences of the robots. The nodes of \(G_{\mathcal {S}}\) are the local states that the robots can be in, such that \(V={\mathcal {S}}\). The edges of \(G_{\mathcal {S}}\) are all possible transitions between local states.
\(G_{\mathcal {S}}^1\) indicates all state transitions that a robot could go through by an action of its own, based on \({\Pi }_0\).
\(G_{\mathcal {S}}^2\) indicates all state transitions that a robot could go through by an action of its neighbors, which could also move out of view.
\(G_{\mathcal {S}}^3\) indicates all state transitions that a robot could go through if another robot, previously out of view, were to move into view and become a new neighbor.
5.1.4 Verifying that the pattern will always eventually emerge
In Coppola et al. (2019), it was shown that, by analyzing certain properties of \(G_{\mathcal {S}}^1\), \(G_{\mathcal {S}}^2\), and \(G_{\mathcal {S}}^3\), it can be verified that the pattern \(P_{des}\) will eventually form starting from any initial pattern \(P_0\). This way, we can assess whether a policy is such that the final pattern will always eventually emerge, or whether it may cause the swarm to reshuffle endlessly without ever settling into the pattern. (We refer to this situation as a livelock.)
 1.
\(G_{\mathcal {S}}^1 \cup G_{\mathcal {S}}^2\) shows that each state in \({\mathcal {S}}\) features a path to each state in \({\mathcal {S}}_{des}\).
 2.
For all states \(s\in {\mathcal {S}}_{static}\cap {\mathcal {S}}_{\lnot simplicial}  s_{surrounded}\), none of the cliques of each state can be formed only by robots that are in a state \(s\in {\mathcal {S}}_{des}\cap {\mathcal {S}}_{simplicial}\). Here, \(s_{surrounded}\) is the state that is surrounded by neighbors along all directions.
 3.
\(G_{\mathcal {S}}^{2r}\) shows that all static states with two neighbors can directly transition to an active state.
 4.
\(G_{\mathcal {S}}^1\) shows that any robot in state \(s\in {\mathcal {S}}_{active}\cap {\mathcal {S}}_{simplicial}\) can travel around all its local neighbors, as exemplified in Fig. 10c (with the exception of when a loop is formed or when it enters a state \(s\in {\mathcal {S}}_{static}\)).
 5.
In \(G_{\mathcal {S}}^3\), any state \(s\in {\mathcal {S}}_{static}\) only has outward edges toward states \(s\in {\mathcal {S}}_{active}\) (with the exception of a state that is fully surrounded along two or more perpendicular directions).
These conditions are local in nature. They analyze the local states of a robot based on its limited sensing range and the actions that the robot could take as a result. The advantage of this is that checking whether the conditions are met is independent of the size of the swarm, avoiding the combinatorial explosion that would otherwise ensue. Note that the conditions are sufficient, but not necessary. Fulfilling them means that the pattern will be achieved, but not fulfilling them does not mean the pattern will not be achieved.^{8} The first four patterns in Fig. 9 pass the proof conditions. The line, instead, does not fulfill all conditions. Moreover, it can be subject to spurious patterns (for example, a slanted H) as a result of \({\mathcal {S}}_{static}\). A more thorough discussion on the patterns that can be generated using this approach can be found in the original paper.
This verification procedure, combined with the fact that the behavior deals with very limited robots (anonymous, homogeneous, memoryless, with limited range sensing, and without needing any communication, global knowledge, or seed robots) moving in space, sets the work by Coppola et al. (2019) apart from other works such as the ones of Klavins (2007), Yamins and Nagpal (2008), or Rubenstein et al. (2014). Furthermore, we remind the reader that here we are dealing with robots that should move in space such that the swarm arranges into a desired spatial configuration. This is different in nature from altering the internal states of the robots into a particular repeating pattern, such as in the work of Yamins and Nagpal (2008).
5.2 PageRank model
Concerning \(\alpha \), we follow the same definition as for the consensus task (see Eq. 9), whereby we define it based on the number of neighbors that a robot senses in a given state. This models how, with more neighbors, the robot is more likely to be subject to its environment. We once again have \(0\le p_{action}(s) \le 1\), where \(p_{action}(s)\) is the cumulative probability of taking an action when in state s. If \(p_{action}(s) < 1\), then there is also a probability that the robot will not take an action and that it will remain idle. In the example shown in Fig. 11a, for instance, we would have \(p_{action}(s_i) = p_l + p_r \le 1\). Here, \(p_l\) is the probability of taking an action to the left, and \(p_r\) is the probability of taking an action to the right, according to the policy that is being used.
5.3 Optimization strategy
 Phase 1: state–action pair removal from\({\Pi }_0\) In Phase 1, state–action pairs are eliminated from the stochastic policy \({\Pi }_0\) with the goal of maximizing F. The input to this phase is the baseline policy that is extracted following the methods described in Sect. 5.1. The output of this first phase is a stochastic policy \({\Pi }_1\subseteq {\Pi }_0\), whereby state–action pairs in \({\Pi }_0\) that do not help to achieve the pattern efficiently are automatically removed, while maximizing F as per Eq. 7. The optimization of Phase 1 is subject to the following two constraints:
 1.
The conditions from Sect. 5.1.4must be respected This constraint checks that the pattern will always eventually be formed from any initial configuration. The conditions can be checked based from the local graphs \(G_{\mathcal {S}}^1\), \(G_{\mathcal {S}}^2\), and \(G_{\mathcal {S}}^3\), and therefore, the time required to check them scales with the local statespace size, and not the size of the swarm.
 2.
The final pattern must remain the unique emergent pattern As state–action pairs are removed from \({\Pi }_0\), it may be that additional states behave like states in \({\mathcal {S}}_{static}\) and will not move. However, an important axiom needed to guarantee that \(P_{des}\) will always form is that, for a swarm of N robots, N instances of the local states in \({\mathcal {S}}_{static}\), with repetition, must uniquely rearrange into \(P_{des}\). If this is not the case, another pattern could emerge where all robots are in a state \({\mathcal {S}}_{static}\) and do not move. It must therefore be verified that \({\Pi }_0\) is not affected in such a way that this can happen. The method to verify this is described at the end of this subsection.
 1.
Phase 2: probability optimization In Phase 2, the probability of executing individual the state–action pairs in \({\Pi }_1\) is altered so as to maximize the fitness F. This phase parallels the final optimization step of Coppola and de Croon (2018), with the key difference being that we now evaluate the performance of swarm using the PageRankbased fitness function in Eq. 7, rather than by simulating the swarm. The output of this second phase is a stochastic policy \({\Pi }_2\).
5.4 Phase 1: genetic algorithm setup and results
Genome size during Phase 1 for the four patterns tested
Pattern  Triangle (four robots)  Hexagon  Triangle (nine robots)  T (12 robots) 

\({\Pi }_0 \)  187  486  531  525 
\({\Pi }_1 \)  66  174  176  276 
With this setup, five evolutionary runs were performed for the first four patterns of Fig. 9.^{9} The results of the evolutionary runs are shown in Fig. 12. Within 2000 generations, the fitness of all patterns approached convergence for all evolutionary runs, characterized by a steep increase in the first generations.
The performance of all the evolved policies was evaluated in simulation. The simulator was a direct implementation of the swarm on a grid as described in Sect. 5.1.1. Each policy was evaluated 100 times with the swarm starting from random initial configurations, with the only requirement that they would begin in a connected topology. The global metric of performance was the amount of actions taken cumulatively by all robots before the desired final pattern emerged. This was compared to the original unoptimized policy \({\Pi }_0\) of each pattern. The results are shown as histograms in Fig. 13. It can be seen that all policies are able to achieve the pattern and significantly outperform the original unoptimized policy by approximately one order of magnitude, showing a successful application of PageRank as an optimization parameter. The difference of the evolved performance from the original was confirmed by a bootstrap test with 10,000 repetitions, using the sample mean and the ratio of the variances as the reference statistics. All distributions were declared different over the 95% confidence interval. We now continue to Phase 2 in order to further improve the performance of the swarm.
5.5 Phase 2: genetic algorithm setup and results
 1.
The GA optimizes the probability of executing the state–action pairs from a stochastic policy \({\Pi }_1\), rather than determining which state–action pairs could be removed.
 2.
The constraints are no longer enforced (but, as none of the state–action pairs are removed, the provability is preserved).
The improvements achieved for the triangle with four robots and the hexagon appear marginal. For two policies evolved for triangle with four robots, a bootstrap test (based on the difference in sample means) could not reject the hypothesis that the distribution of the final performance is equal to the performance at the end of Phase 1. This was the case for also one of the policies evolved for the hexagon. However, this is a justified result, because for both the triangle with four robots and the hexagon, the performance after Phase 1 already approached the performance of a policy which was optimized with the fitness being evaluated through simulations, which we take to be a nearlimit performance. The evolutions of Phase 2 achieved policies that performed equivalently and, in some cases, seemingly even slightly better than the ones that were achieved in Coppola and de Croon (2018). In Fig. 15, it can be seen that a higher fraction of runs take fewer than ten actions and fewer than 100 actions for the triangle with four robots and the hexagon, respectively. These results show the power of PageRank’s framework in being able to describe the swarm up to the point that the achieved performance matches the performance of results that are achieved with a global evaluation procedure.
For the triangle with nine robots and the T with 12 robots, Phase 2 instead shows significant improvements over Phase 1. These are most noticeable for the T with 12 robots, which reaches an improvement in performance of one order of magnitude compared to the policies from Phase 1. Overall, this means that the policies following Phase 2 outperform the unoptimized policy by two orders of magnitude. We remind the reader that both of these patterns could never have been tackled if, instead of using PageRank, we had used simulation in order to assess the performance of the swarm. Simulations would have been infeasible in light of scalability issues, as was indeed the case in Coppola and de Croon (2018). Instead, these results have been achieved with the use of an entirely local evaluation.
5.6 Analysis
The results of Phase 1 and Phase 2 both show that the evolutions successfully lead to policies with high performance boosts, which even match the performance of policies that were optimized using a fitness function based on a global simulation of the swarm. In this section, we shall now aim to gain more insight into how the PageRankbased fitness function was correlated with the global performance of the swarm. As we are now dealing with a large policy with a large number of parameters (see Table 3), the entire fitness landscape cannot be studied. Instead, we evaluated the global performance at various stages of the Phase 1 and Phase 2 evolutions to show how the performance improvements are correlated with the PageRankbased fitness.
the hexagon, starting from around the 100th generation, for all runs.
the triangle with nine robots, where one of the evolutionary runs features a drastic spike in performance at approximately the 400th generation (after which it seems to recover).
 1.
The change in fitness is low compared to Phase 1 evolutions; hence, the effects of noise are more clearly noticeable.
 2.
The impact of changing probabilities in a stochastic policy is less direct than completely removing a state–action pair from the policy. Therefore, the changes have less impact on the results.
5.7 Further investigations
We conclude this section with a further investigation, whereby we explore whether placing more importance on certain desired states can be beneficial or destructive to the pattern formation task. This is relevant to patterns that feature a repeating desired state.
6 Sample task 3: aggregation
In this task, robots move within a closed arena with the goal of forming aggregates. Each robot can sense the location of neighbors within a range. This task is similar in nature to that of Correll and Martinoli (2011). Using PageRank, we optimize a stochastic policy that tells a robot whether it should perform a random walk or remain idle based on the number of neighbors that it can sense. The final policy is optimized without taking into account the lowerlevel implementation of the random walk, the number of robots in the swarm, the size of the environment, or the initial condition. Applying the framework to optimize a higherlevel policy such as this one provides further insight into the applicability and flexibility of the approach.
6.1 Task description
Consider N robots in a closed arena. The robots operate in continuous space (rather than a grid, as in previous tasks), and in continuous time, using local clocks. Each robot can sense the number of neighbors, m, within a limited range \(\rho _{\max }\). We assume that each robot is capable of sensing a maximum of M neighbors, after which the sensor saturates. Therefore, the local state space \({\mathcal {S}}\) for this task is given by all possible numbers of neighbors that can be sensed by the robot, namely \({\mathcal {S}} = \{0,1,\ldots ,M\}\). The sensor updates its reading with a time interval \(t_c\).
 1.
Perform/resume a random walk behavior. We denote this action as \(a_{walk}\).
 2.
Stop and remain idle. We denote this action as \(a_{idle}\).
6.2 PageRank model
To set up the PageRank framework, we begin by defining the active graph \(G_{\mathcal {S}}^a\), which describes the local state changes as a result of the actions of a robot.
Graph \(G_{\mathcal {S}}^p\) holds the transitions that a robot can experience regardless of its own choice to be idle or not. In this case, the robot can be in any state \(s\in {\mathcal {S}}\), at which point it could lose a neighbor or gain a neighbor at any point. The graph \(G_{\mathcal {S}}^p\) takes the form shown in Fig. 20b.
6.3 Genetic algorithm setup
We used the same GA setup as used for the consensus task in order to tune the probabilities of the policy \({\Pi }\) for this task. We considered the case where \(m_{\min }=2\), meaning that the robots have the goal of forming aggregates of three or more. Unlike other cases, we did not set the policy values for the states \({\mathcal {S}}_{des}\) to zero. Note that this is not necessary and actually makes the problem harder to solve, since more freedom is given to the optimizer with more parameters to be optimized. Setting the policy values for the states \(\mathcal {S}_{des}\) would have automatically forced a relative basic solution (i.e., “move when not in a group”) upon the behavior. Thus, we instead took this as an opportunity to see whether PageRank would be able to tune its weights to also inform the robots to remain in the states of \({\mathcal {S}}_{des}\), without us providing this knowledge a priori.
6.4 Simulation environment and evaluation results
When initiating a random walk, a robot could randomly select a direction \(\theta \) from a uniform distribution \(\left[ 0, ~2\pi \right] \) and pursue it with a reference speed v. The decision to continue the random walk or stop, commanded by the highlevel policy, was reassessed with an interval \(t_c\). If the robot chose to remain in a random walk, then it altered its current direction \(\theta \) by an angle \(\theta _d\). The angle \(\theta _d\) was extracted from a normal distribution with zero mean and a standard deviation \(\sigma _d\). In our implementation, we set \(v = 0.5\,\mathrm{m/s}\), \(t_c=2\,\mathrm{s}\), and \(\sigma _\theta = 0.3\,\mathrm{rad}\). The robots could sense their neighbors omnidirectionally up to a distance \(\rho _{\max } = 2\,\mathrm{m}\).
If a robot was at a distance \(d_{w}\) from a wall, it would set a velocity command away from the wall for a time \(t_w\). In our implementation, we set \(d_w = \rho _{\mathrm{max}} = 2\,\mathrm {m}\) and \(t_w = 2t_c\). Setting \(d_w = \rho _{\max }\) simulates the fact that the same sensor is used to sense both neighbors and the walls.
At all times, in direct addition to the velocity commands from the random walk behavior and wall avoidance, the robots were governed by attraction and repulsion forces. This handled collision avoidance. In our implementation, the attraction and repulsion behavior was tuned such that the equilibrium distance between robots was \(1.5\,\mathrm{m}\).
6.5 Analysis
As for the variant of the consensus task from Sects. 4.4 and 4.5, we took advantage of the small size of this policy in order to map and investigate the fitness landscape. The investigation was focused on testing the impact of the probability of moving within the arena when the robots did not have any neighbors and when the robot had one neighbor. Since the robots had the goal of being with two or more neighbors, these were the two main parameters of interest.
The simulation environment from Sect. 6.4 was used to evaluate the performance of the swarm against different policies, in order to construct the fitness landscape with respect to our global parameter of interest. The results were obtained by averaging 50 simulations for each setting, and repeated for swarms of 20 robots and 50 robots in a \(30\,\mathrm{m}\times 30\,\mathrm{m}\) arena. The maximum simulated time allowed for the task was set to 1000 s. The results are shown in Fig. 23a, c, respectively.
7 Discussion
In Sects. 5, 4, and 6, the PageRank framework has been applied to optimize the stochastic policy of robots in a swarm for three separate tasks. With the knowledge gained, this section summarizes the advantages of the proposed approach (Sect. 7.1), its current limitations, and possible solutions (Sect. 7.2) and proposes research directions that could lead to interesting extensions (Sect. 7.3).
7.1 Advantages and properties of the proposed approach
The number of robots in the swarm, which affects both the processing time required to simulate it and the time before which the global goal is achieved.
The size of the environment, which would alter the rate at which events occur. This would have been relevant for a task such as the aggregation task.
The initial conditions from which the swarm is simulated. To avoid over fitting to a particular initial condition, each controller would have had to be tested from several random initial conditions.
The number of robots used in the simulations. To avoid over fitting to swarms of a specific size, each controller would have had to be tested on swarms of different sizes.
The global environment. If evolving within a specific environment, the final solution may implicitly be fitted to it.
The wallclock time does not scale with the size of the swarm This can be appreciated directly from the results in this manuscript, such as for the pattern formation task, where we have managed to optimize the policy for patterns that we could not tackle using simulation to evaluate the fitness (Coppola and de Croon 2018). We have plotted the wallclock time required to evaluate the PageRankbased fitness for the pattern formation task in Fig. 24a. Each data point is averaged over 100 evaluations of PageRank for random instances of a stochastic policy. It can be seen that, once the maximum neighborhood size of 8 is fulfilled, the wallclock time stops scaling and it remains constant even if the number of robots in the swarm increases. Instead of scaling with global parameters, the wallclock time scales with the number of local states \({\mathcal {S}}\) and the possible transitions between them, as modeled by \(G_{\mathcal {S}}^a\) and \(G_{\mathcal {S}}^p\). To test the practical impact of this, we evaluated the wallclock time of the PageRankbased fitness function when varying the size of \({\mathcal {S}}\) (which also bring about more state transitions) for the separate tasks tested in this paper, averaged over 100 random policy iterations. The results are shown in Fig. 24b. It can be seen that the wallclock time grows with the size of \({\mathcal {S}}\). In comparison, the fitness wallclock time for simulations directly grows with the size of the swarm. This is shown for the pattern formation task from Fig. 25. Note that this is despite using the simple grid simulation environment as was used for the simulations in Sects. 5.4, 5.5, and 5.6. The performance of higher fidelity simulators would be worse.
The evolved policy scales well with the number of robots in the swarm The policy is evaluated from a graph for which the number of robots in the swarm is not modeled, only the probability that certain events in the neighborhood of a specific robot may take place. For example, for the consensus task, the same exact policy could be applied to swarms of different sizes (we tested with swarms of ten and 20 robots), from random initial configurations, and consistently provided efficient results orders of magnitude better than the baseline case.
The evolved policy is not tuned to any particular initial condition of the swarm The proposed PageRankbased fitness function (Eq. 7) evaluates the likelihood of achieving certain local states. However, as PageRank is evaluated for the entire graph simultaneously using the iterative procedure described in Sect. 3, the score does not assume any initial condition by the robot, and, by extension, the swarm. This makes it robust to the initial conditions. In all results shown in this paper, there was no initial condition for which the evolved solution made it impossible to achieve the final global goal or for which the performance did not improve.
The evolved policy is not tuned to any particular lowerlevel controller In all instances, a policy was evolved without any specification of lowerlevel parameters. This can be best appreciated for the aggregation task. Here, we evolved a highlevel policy without specifying the type of random walk that was going to be implemented, but only specifying the relative probability of encountering neighbors when a robot is moving. Had such a behavior been tuned in simulation, then it would have inherently adapted to the particular implementation of search behavior (e.g., wallfollowing, Lévy Flight, etc.), whereas this is not the case for our results, providing a more general controller. The relative probability of encountering neighbors subsumes a larger set of specific lowerlevel behaviors. Similarly, for the consensus task and the pattern formation task, the behavior is evolved without any particular specification of communication time, decision time, or motion controller. A simulationbased optimization might have adapted the probabilities to implicitly account for these specifications. Instead, the only concern for the PageRankbased optimization is with which probability it should transition between states.
The evolved policy is not tuned to any particular global environment PageRank’s framework only models the possible transitions that may happen in the environment, but not the entire environment itself. Therefore, as long as the possible local transitions that may take place in a global environment are sufficiently well represented, then the evolved policy remains robust to different environments. In some instances, the global environment may also be excluded altogether. This was the case for the consensus task and the pattern formation task.
7.2 Current limitations and potential solutions
Discrete state space and discrete action space The current version of the approach relies on a discrete set of local states and a discrete set of actions. This may be limiting for applications with continuous states. The limitation comes from the fact that we are using a tabular policy which maps discrete states to discrete actions, which then allows us to construct a graph with discrete nodes and edges. Although this is a limitation of the current form, future work can explore the possibilities of extending the approach to deal with continuous state spaces. A research direction that would be interesting to follow is the possibility to use basis functions to interpolate over continuous state spaces. This type of solution has been proposed in the past for other approaches that also natively rely on a tabular policy, such as Qlearning, in order to accommodate continuous state spaces and/or action spaces (Sutton et al. 1998; Busoniu et al. 2010).
“Missing out” on the global optimum Although the benefits of swarming are often credited as flexibility, robustness, and scalability (Hamann 2018), some applications may benefit from using policies which are tuned to a specific settings. For instance, if it is known a priori that a swarm of robots will always exclusively operate within a specific environment and be composed of a fixed number of robots, then it may be desired to find the global optimum for this particular setup. As an example, consider an aggregation task with a swarm of three robots. An optimal policy might feature that it is beneficial for two robots to wait for the third to come into their neighborhood. The number of robots would thus be implicitly encoded into the final policy. Instead, the locally evolved policy would not consider this option, as the number of robots would not be known nor modeled explicitly throughout the evolution. Although this means that the solution is likely more scalable, flexible, or robust, it does mean that it is less globally optimal for the specific case. Moreover, as the approach is local in nature, it is not possible to directly evaluate global swarm parameters such as the ideal number of robots for a given task.
The need to define desired local states A complexity and potential limitation of using our approach is the need to define the set \({\mathcal {S}}_{des}\). This is the set of locally desired states that form the global goal. For the three tasks shown in this paper, \({\mathcal {S}}_{des}\) was extracted with ease using intuition. Nevertheless, as task complexity grows, \({\mathcal {S}}_{des}\) may become more difficult to define. A potential solution to this problem could be to use an automatic method which is capable of inferring the local desired states by observing a global behavior. This is discussed further in Sect. 7.3.
Assumption of successful state transitions The microscopic model used to evaluate PageRank assumes that state transitions are successful in executing the policy. In reality, this may not always be the case in light of imperfect actuation or noisy sensors, which may cause unexpected state transitions that are not modeled. This limitation can be addressed at the robot level via proper filtering of sensory data as well as accurate lowerlevel control. From the policy optimization side, it would be interesting to see whether this issue can be tackled via an online refinement of the model, allowing the policy to adapt itself to the imperfections of the real world during deployment. We continue this discussion in Sect. 7.3.
7.3 Further potential extensions

Extension to online learning The current implementation uses an offline learning process, whereby the graphs \(G_{\mathcal {S}}^{a}\) and \(G_{\mathcal {S}}^{p}\) are devised based on a model of the robot and the task. There is also the possibility to extend the approach to online learning. The graphs \(G_{\mathcal {S}}^{a}\) and \(G_{\mathcal {S}}^{p}\) are based on the expected local states and transitions that the robot can experience. All these local states and all these transitions can be measured by the local sensors present on the robot. Therefore, this means that it is also possible for a robot to construct, expand, and/or tune the graphs \(G_{\mathcal {S}}^{a}\) and \(G_{\mathcal {S}}^{p}\) online directly based on its experiences during operation. In turn, it could then reoptimize its controller to suit these new graphs. For example, this could be done in order to automatically tune to the failure probability of certain actions using an online refinement of the transition rates. It can also be used to tune against “noisy” actuators which fail (with a given measurable probability) to follow through on an action, and either do not complete the action or create a different state transition than planned. Alternatively, it could adapt to systematic events in its environment, placing a higher probability on those transitions in the graph \(G_{\mathcal {S}}^{p}\). The parameter vector \({\varvec{\alpha }_\mathbf {v}}\) could also be estimated or refined online. Thanks to the generally low computational efforts required to run the optimization (notice the wallclock times in Fig. 24), the optimization could also be performed onboard of a robot without requiring excessive computational power.

Alternative fitness functions The fitness function proposed in this work (Eq. 7) focused on maximizing the probability that a robot in the swarm would be in a desired state. The hypothesis that this would lead to global level improvements was found to be successful for each of the tested scenarios. Further investigations led us to the discovery that there may also be correlations to other parameters stemming from PageRank. For instance, using the results of the analysis from the pattern formation task from Sect. 5.6, we have found that the variance of the centrality also appears to show a correlation with the global performance of the swarm. In future work, it would be interesting to see whether, and how, alternative fitness functions could be constructed in order to provide even larger global level benefits.

Extension to heterogeneous swarms This paper has tested the algorithm on homogeneous swarms of robots with one global goal. However, we wish to note that swarm homogeneity is not a direct limitation of the proposed algorithm. In fact, the “passive” graph \(G_{\mathcal {S}}^p\), which models the environment, already does not assume that the neighboring robots operate with the same policy. (Rather, it assumes that all possible changes in the environment are equally probable.) For a swarm of functionally heterogeneous robots, we anticipate the main difficulty to be in defining concurrent sets of desired states for each type of robot, which all combine toward a greater global goal.

Including transitions with a probability of failure An assumption that was taken for the tasks in this work was that, if a robot decides to perform an action, then it will always succeed. In reality, however, this may not always be the case, as certain state–action pairs may have a lower success rate than anticipated in the real world. However, this failure probability can be directly factored in within the transition probabilities of the graphs \(G_{\mathcal {S}}^a\) and \(G_{\mathcal {S}}^p\). On this note, it could also be possible for a robot to do this online while performing the task.

Automatic extraction of desired states As stated in Sect. 7.2, a limitation of the approach is the need to explicitly define the locally desired states. It can be postulated that this may not always be intuitively done by a swarm designer, particularly as task complexity increases. However, it can be argued that most (if not all) tasks require that some local states are more “desired” than others, and that these must be locally observable by the robot (even if not apparent, or understood, by the designer)—the challenge is to find them. In future work, this challenge could be solved by a multilayered learning approach, whereby a higherlevel learner is tasked with extracting the desired local states for a given task.
8 Conclusion
The framework proposed in this paper is based on the idea to microscopically model the local experiences of a robot in a swarm by means of graphs. It then becomes possible to evaluate the PageRank centrality of the local states and use this as a measure of the global behavior of the swarm by evaluating the relative PageRank of local states of interest. This can be used within an optimization procedure in order to optimize the global behavior of swarms while only requiring a local analysis.
This technique provides several advantages which natively include flexibility, robustness, and scalability, which are three key properties of swarm robotics. We have applied it to a pattern formation task, a consensus task, and an aggregation task. The time needed to perform the optimization did not scale with the size of the swarm, which allowed us to tackle swarm sizes that could not be tackled if simulation had been our means of assessment. Moreover, the results obtained were natively independent of global parameters such as the initial starting condition, the environment, or the number of robots in the swarm. The new insights presented in this paper offer a new strategy to evaluate the final behavior of groups from the perspective of the individual. We expect that the approach can generalize to several other tasks, whereby a robot in the swarm and its environment is modeled in analogy to how a user surfs the World Wide Web.
Footnotes
 1.
An alternative to reevaluation is to vary other parameters. For instance, one could simulate once but for a longer time (Di Mario and Martinoli 2014), although this is applicable to continuing task and not to tasks with a definite global goal.
 2.
In the work by Winfield et al. (2008), the swarm operated in an unbounded arena. Instead, it was assumed that neighbors of each robot would be equally dispersed within the robot’s sensing and communication range. This was shown to be reliable up to a certain extent, in part thanks to the behavior that was implemented.
 3.
The interested reader is referred to the work of Fornito et al. (2016), where a summary and comparison of several centrality measures are provided.
 4.
For more details, including sample code and even some humorous anecdotes, we refer the reader to the book “Google’s PageRank and Beyond: The Science of Search Engine Rankings” by Langville and Meyer (2006).
 5.
A stochastic matrix holds the probability of transitioning between nodes as it would be described by Markov chains. It follows that the sum of each row must be equal to 1. A matrix \(\mathbf {A}\) is primitive if \(\exists k\forall (i,j):A_{ij}^k > 0\). A primitive matrix is both irreducible and aperiodic. Irreducible means that any state in a Markov chain is reachable from any other state. Aperiodic means that there is no set period for returning to any given state. These properties allow the iterative algorithm to converge.
 6.
We attribute the fact that the solution is nearly deterministic, and not fully deterministic, to our GA implementation.
 7.
If the neighbors of a robot form only one clique, then this means that, if this robot were to disappear, it is guaranteed that its neighbors would all remain connected among each other.
 8.
To this point, Condition 4 was made slightly more lenient than from what is expressed in (Coppola et al. 2019). The original version checks that a robot can travel to all open positions surrounding all of its neighbors. This is a strongly restrictive rule that heavily limits optimization. Therefore, we limited to checking that a robot could travel about all its neighbors, but not to all open grid positions around all its neighbors.
 9.
As explained in Sect. 5.1.4, the line does not pass the verification procedure. It thus cannot be optimized in Phase 1. For this reason, it will be directly optimized using the GA of Phase 2, which does not directly enforce any constraints. The results for the line are shown separately in Sect. 5.6.
 10.
Unfortunately, we cannot show this analysis for the T with 12 robots. This is because evaluating early generations in simulation is too computationally expensive. This serves as another reminder of how infeasible it would have been to try and optimize the behavior of the swarm using simulations to assess the fitness. Only the results before and after the evolutions have been evaluated, which can be appreciated in Figs. 13d and 15d.
 11.
Note: creating a line along northeast is subject to both spurious patterns and livelocks. These, which happened on occasion, have been removed from the results shown in these figures, as they are outside the scope of this discussion.
Notes
References
 Berman, S., Halász, Á., Kumar, V., & Pratt, S. (2007). Algorithms for the analysis and synthesis of a bioinspired swarm robotic system. In E. Şahin, W. M. Spears, & A. F. T. Winfield (Eds.), Swarm robotics (pp. 56–70). Heidelberg: Springer.Google Scholar
 Berman, S., Halasz, A., Hsieh, M. A., & Kumar, V. (2009). Optimized stochastic policies for task allocation in swarms of robots. IEEE Transactions on Robotics, 25(4), 927–937.Google Scholar
 Berman, S., Nagpal, R., Halász, A. (2011). Optimization of stochastic strategies for spatially inhomogeneous robot swarms: A case study in commercial pollination. In 2011 IEEE/RSJ international conference on intelligent robots and systems (pp. 3923–3930).Google Scholar
 Brin, S., Page, L. (1998). The anatomy of a largescale hypertextual web search engine. In Seventh international worldwide web conference (WWW 1998).Google Scholar
 Busoniu, L., Babuska, R., Schutter, B. D., & Ernst, D. (2010). Reinforcement learning and dynamic programming using function approximators. Boca Raton: CRC Press Inc.zbMATHGoogle Scholar
 Campo, A., & Dorigo, M. (2007). Efficient multiforaging in swarm robotics. In F. Almeida e Costa, L. M. Rocha, E. Costa, I. Harvey, & A. Coutinho (Eds.), Advances in artificial life (pp. 696–705). Berlin: Springer.Google Scholar
 Coppola, M., & de Croon, G. C. H. E. (2018). Optimization of swarm behavior assisted by an automatic local proof for a pattern formation task. In M. Dorigo, M. Birattari, C. Blum, A. L. Christensen, A. Reina, & V. Trianni (Eds.), Swarm intelligence (pp. 123–134). Cham: Springer.Google Scholar
 Coppola, M., Guo, J., Gill, E., & de Croon, G. C. H. E. (2019). Provable selforganizing pattern formation by a swarm of robots with limited knowledge. Swarm Intelligence, 13(1), 59–94.Google Scholar
 Correll, N., & Martinoli, A. (2006). Collective inspection of regular structures using a swarm of miniature robots. In M. H. Ang & O. Khatib (Eds.), Experimental robotics IX: The 9th international symposium on experimental robotics (pp. 375–386). Heidelberg: Springer.Google Scholar
 Correll, N., & Martinoli, A. (2011). Modeling and designing selforganized aggregation in a swarm of miniature robots. The International Journal of Robotics Research, 30(5), 615–626.Google Scholar
 Di Mario, E., & Martinoli, A. (2014). Distributed particle swarm optimization for limitedtime adaptation with real robots. Robotica, 32(2), 193–208.Google Scholar
 Di Mario, E., Navarro, I., Martinoli, A. (2015a). A distributed noiseresistant particle swarm optimization algorithm for highdimensional multirobot learning. In 2015 IEEE international conference on robotics and automation (ICRA) (pp. 5970–5976).Google Scholar
 Di Mario, E., Navarro, I., Martinoli, A. (2015b). Distributed particle swarm optimization using optimal computing budget allocation for multirobot learning. In 2015 IEEE congress on evolutionary computation (CEC) (pp. 566–572).Google Scholar
 Duarte, M., Costa, V., Gomes, J., Rodrigues, T., Silva, F., Oliveira, S. M., et al. (2016). Evolution of collective behaviors for a real swarm of aquatic surface robots. PLOS One, 11, 1–25.Google Scholar
 Ericksen, J., Moses, M., Forrest, S. (2017). Automatically evolving a general controller for robot swarms. In 2017 IEEE symposium series on computational intelligence (SSCI) (pp. 1–8).Google Scholar
 Ferrante, E., Duéñez Guzmán, E., Turgut, A. E., Wenseleers, T. (2013). GESwarm: Grammatical evolution for the automatic synthesis of collective behaviors in swarm robotics. In Proceedings of the 15th annual conference on genetic and evolutionary computation (GECCO) (pp. 17–24). New York: ACM.Google Scholar
 Fornito, A., Zalesky, A., & Bullmore, E. T. (2016). Centrality and hubs (pp. 137–161). San Diego: Academic Press.Google Scholar
 Francesca, G., & Birattari, M. (2016). Automatic design of robot swarms: Achievements and challenges. Frontiers in Robotics and AI, 3, 29.Google Scholar
 Francesca, G., Brambilla, M., Brutschy, A., Garattoni, L., Miletitch, R., Podevijn, G., et al. (2015). Automodechocolate: Automatic design of control software for robot swarms. Swarm Intelligence, 9(2), 125–152.Google Scholar
 Gomes, J., Urbano, P., & Christensen, A. L. (2012). Introducing novelty search in evolutionary swarm robotics. In M. Dorigo, M. Birattari, C. Blum, A. L. Christensen, A. P. Engelbrecht, R. Groß, & T. Stützle (Eds.), Swarm intelligence (pp. 85–96). Heidelberg: Springer.Google Scholar
 Gomes, J., Urbano, P., & Christensen, A. L. (2013). Evolution of swarm robotics systems with novelty search. Swarm Intelligence, 7(2), 115–144.Google Scholar
 Hamann, H. (2018). Swarm robotics: A formal approach. Berlin: Springer.Google Scholar
 Hamann, H., & Wörn, H. (2008). A framework of spacetime continuous models for algorithm design in swarm robotics. Swarm Intelligence, 2(2), 209–239.Google Scholar
 Hamann, H., Valentini, G., Khaluf, Y., & Dorigo, M. (2014). Derivation of a micromacro link for collective decisionmaking systems. In T. BartzBeielstein, J. Branke, B. Filipič, & J. Smith (Eds.), Parallel problem solving from nature—PPSN XIII (pp. 181–190). Cham: Springer.Google Scholar
 Hsieh, M. A., Halász, Á., Berman, S., & Kumar, V. (2008). Biologically inspired redistribution of a swarm of robots among multiple sites. Swarm Intelligence, 2(2), 121–141.Google Scholar
 Hüttenrauch, M., Šošić, A., Neumann, G. (2017). Guided deep reinforcement learning for swarm systems. ArXiv:1709.06011.
 Izzo, D., Simões, L. F., & de Croon, G. C. H. E. (2014). An evolutionary robotics approach for the distributed control of satellite formations. Evolutionary Intelligence, 7(2), 107–118.Google Scholar
 Jones, S., Studley, M., Hauert, S., & Winfield, A. (2018). Evolving behaviour trees for swarm robotics (pp. 487–501). Cham: Springer.Google Scholar
 Klavins, E. (2007). Programmable selfassembly. IEEE Control Systems, 27(4), 43–56.Google Scholar
 Langville, A. N., & Meyer, C. D. (2006). Google’s PageRank and beyond: The science of search engine rankings. Princeton: Princeton University Press.zbMATHGoogle Scholar
 Lerman, K., Galstyan, A., Martinoli, A., & Ijspeert, A. (2001). A macroscopic analytical model of collaboration in distributed robotic systems. Artificial Life, 7(4), 375–393.Google Scholar
 Lerman, K., Martinoli, A., & Galstyan, A. (2005). A review of probabilistic macroscopic models for swarm robotic systems. In E. Şahin & W. M. Spears (Eds.), Swarm robotics (pp. 143–152). Heidelberg: Springer.Google Scholar
 Martinoli, A., & Easton, K. (2003). Modeling swarm robotic systems. In B. Siciliano & P. Dario (Eds.), Experimental robotics VIII (pp. 297–306). Heidelberg: Springer.Google Scholar
 Martinoli, A., Easton, K., & Agassounon, W. (2004). Modeling swarm robotic systems: A case study in collaborative distributed manipulation. The International Journal of Robotics Research, 23(4–5), 415–436.Google Scholar
 Nelson, A. L., Barlow, G. J., & Doitsidis, L. (2009). Fitness functions in evolutionary robotics: A survey and analysis. Robotics and Autonomous Systems, 57(4), 345–370.Google Scholar
 Nolfi, S. (2002). Power and the limits of reactive agents. Neurocomputing, 42(1–4), 119–145.zbMATHGoogle Scholar
 Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Technical report 199966, Stanford InfoLab.Google Scholar
 Prorok, A., Correll, N., & Martinoli, A. (2011). Multilevel spatial modeling for stochastic distributed robotic systems. The International Journal of Robotics Research, 30(5), 574–589.Google Scholar
 Reina, A., Miletitch, R., Dorigo, M., & Trianni, V. (2015). A quantitative micromacro link for collective decisions: The shortest path discovery/selection example. Swarm Intelligence, 9(2), 75–102.Google Scholar
 Rubenstein, M., Cornejo, A., & Nagpal, R. (2014). Programmable selfassembly in a thousandrobot swarm. Science, 345(6198), 795–799.Google Scholar
 Scheper, K. Y. W., de Croon, G. C. H. E. (2016). Abstraction as a mechanism to cross the reality gap in evolutionary robotics. In E. Tuci , A. Giagkos, M. Wilson, J. Hallam (Eds.), From animals to animats 14: 14th international conference on simulation of adaptive behavior, SAB 2016, Aberystwyth, UK, August 23–26, 2016, proceedings (pp. 280–292). Cham: Springer.Google Scholar
 Scheper, K. Y. W., Tijmons, S., de Visser, C. C., & de Croon, G. C. H. E. (2016). Behavior trees for evolutionary robotics. Artificial Life, 22(1), 23–48.Google Scholar
 Silva, F., Duarte, M., Correia, L., Oliveira, S. M., & Christensen, A. L. (2016). Open issues in evolutionary robotics. Evolutionary Computation, 24(2), 205–236.Google Scholar
 Sutton, R. S., Barto, A. G., et al. (1998). Introduction to reinforcement learning (Vol. 135). Cambridge: MIT Press Cambridge.zbMATHGoogle Scholar
 Trianni, V., Nolfi, S., & Dorigo, M. (2006). Cooperative hole avoidance in a swarmbot. Robotics and Autonomous Systems, 54(2), 97–103.Google Scholar
 Winfield, A. F. T., Liu, W., Nembrini, J., & Martinoli, A. (2008). Modelling a wireless connected swarm of mobile robots. Swarm Intelligence, 2(2), 241–266.Google Scholar
 Yamins, D., Nagpal, R. (2008). Automated globaltolocal programming in 1D spatial multiagent systems. In Proceedings of the 7th international joint conference on autonomous agents and multiagent systems (AAMAS) (vol. 2, pp. 615–622). Richland: International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.