The PageRank algorithm as a method to optimize swarm behavior through local analysis

This work proposes PageRank as a tool to evaluate and optimize the global performance of a swarm based on the analysis of the local behavior of a single robot. PageRank is a graph centrality measure that assesses the importance of nodes based on how likely they are to be reached when traversing a graph. We relate this, using a microscopic model, to a random robot in a swarm that transitions through local states by executing local actions. The PageRank centrality then becomes a measure of how likely it is, given a local policy, for a robot in the swarm to visit each local state. This is used to optimize a stochastic policy such that the robot is most likely to reach the local states that are “desirable,” based on the swarm’s global goal. The optimization is performed by an evolutionary algorithm, whereby the ﬁtness function maximizes the PageRank score of these local states. The calculation of the PageRank score only scales with the size of the local state space and demands much less computation than swarm simulations would. The approach is applied to a consensus task, a pattern formation task, and an aggregation task. For each task, when all robots in the swarm executetheevolvedpolicy,theswarmsigniﬁcantlyoutperformsaswarmthatusesthebaseline policy.Whencomparedtogloballyoptimizedpolicies,theﬁnalperformanceachievedby theswarmisalsoshowntobecomparable.Asthisnewapproachisbasedonalocalmodel, itnativelyproducescontrollersthatareﬂexibleandrobusttoglobalparameterssuchasthe numberofrobotsintheswarm,theenvironment,andtheinitialconditions.Furthermore,as the wall-clock time to evaluate the ﬁtness function does not scale with the size of the swarm, it is possible to optimize for larger swarms at no additional computational expense.


Introduction
Machine learning techniques are a powerful approach to develop swarm behaviors.Evolutionary algorithms, for instance, can efficiently explore a solution space and extract viable local behaviors that fulfill a desired global goal (Nolfi 2002;Francesca and Birattari 2016).
B M. Coppola m.coppola@tudelft.nlExtended author information available on the last page of the article 123 They have been used on numerous architectures, including: neural networks (Izzo et al. 2014;Duarte et al. 2016), state machines (Francesca et al. 2015), behavior trees (Scheper et al. 2016;Jones et al. 2018), and grammar rules (Ferrante et al. 2013).A bottleneck of these algorithms is in the need to evaluate how the whole swarm performs against each controller that they generate.Because of the complexity of swarms and the difficulty in predicting the global outcome, a full simulation of the entire swarm is carried out each time.This is subject to scalability issues as the size of the swarm increases, for example: 1.The computational load required to execute the simulations increases with the size of the swarm; 2. It may take longer for the desired behavior to emerge, requiring a longer simulation time for each evaluation trial, especially in the initial stages of the evolution; 3. The evolved policy may be over-fitted to the global parameters used during the simulation, such as the number of robots, the initial conditions, or the environment; 4. A solution needs to be simulated multiple times in order to reliably assess the expected performance of a given behavior (Trianni et al. 2006).Avoiding re-evaluation may result in poor behaviors being erroneously assigned a higher fitness thanks to one lucky run, which may ultimately result in a performance drop (Di Mario et al. 2015a, b);1 5.The evolution may be subject to bootstrap issues (Silva et al. 2016;Gomes et al. 2013).
In this work, in order to tackle these scalability problems, we introduce a new approach to the field of swarm robotics: PageRank (Brin and Page 1998;Page et al. 1999).PageRank is a graph centrality and node ranking algorithm.It was originally developed by Sergey Brin and Larry Page as part of Google™.Its objective was to rank the importance of Webpages based on the hyperlink structure of the World Wide Web.PageRank's philosophy was to model the browsing behavior of a user who surfs the Web by randomly clicking through hyperlinks and to measure the value of Webpages based on how likely it would be for this user to visit them.In this paper, we port this idea to the world of swarm robotics.Here, a robot in a swarm becomes analogous to a Web surfer.The robot moves through local states by taking actions, much like a Web surfer navigates through Webpages by clicking hyperlinks.With PageRank centrality, we can then evaluate the relative likelihood with which the robot will end up in the local states.Then, with the knowledge that a desired global goal is more likely to be achieved when the robots are (or pass through) a given set of local states, we can efficiently quantify the global performance of the swarm in achieving its goal.More specifically, we propose a fitness function, based on PageRank, that can assess the global performance of a swarm while only evaluating the local model of a single robot in the swarm.This micro-macro link frees us from the need to simulate the swarm.Due to the local nature of this approach, the evaluation is independent from global parameters such as the size of the swarm, the initial condition, the environment, or lower-level controllers.The introduction of this method is the main contribution of this paper.We will showcase its potential by applying it to optimize the local behavior for three different swarming tasks: (1) consensus agreement, (2) pattern formation, and (3) aggregation.
We begin the paper by discussing related work in Sect. 2. Here, we place our contribution within the context of other solutions found in the literature which also had the aim of tackling scalability issues.We further compare our use of a PageRank-based microscopic model to other swarm modeling approaches.In Sect.3, we then detail how PageRank works and explain how it can be applied to model, assess, and optimize the performance of a robotic swarm.The approach is then directly applied to optimize the behavior of three swarming tasks.
Consensus agreement (Sect. 4) In this task, we optimize the behavior of a swarm that must achieve consensus between multiple options.Each robot can sense the opinion of its neighbors and, based on a stochastic policy, decide whether it should change its opinion (and, if so, what to change its opinion to).Using PageRank, we optimize the stochastic policy so as to help the swarm achieve consensus as quickly as possible.The policy is optimized independently of the size of the swarm or its spatial configuration.We further optimize a more limited variant of this task whereby robots can not sense the opinion of their neighbors, but can only sense whether they are in agreement with their neighborhood or not.Pattern formation (Sect.5) For this task, we optimize the performance of a swarm of robots with the global goal of arranging into a desired spatial configuration.The robots in the swarm have very limited knowledge of their surroundings.This section extends our recent work published in Coppola and de Croon (2018), where we first attempted to optimize the behavior of a swarm in a pattern formation task, yet quickly encountered scalability problems for larger swarms/patterns.The scalability problems were a result of the fact that the swarm had to be simulated in order to assess the efficacy of a controller.This was infeasible for larger swarms, especially in early generations where performance is poor.Using the PageRank algorithm, we can now tackle these scalability issues, as it is no longer needed to simulate the swarm in order to assess the fitness of a behavior.Aggregation (Sect. 6) In this task, we study a swarm of robots in a closed arena which should aggregate in groups of three or more.In comparison with the other two tasks, this optimization tunes a higher-level policy featuring two sub-behaviors (random walk and stop) with a probability that is dependent on the number of neighbors that a robot in the swarm can sense.The final achieved policy allows the swarm as a whole to aggregate successfully.
In Sect.7, we discuss our general findings, including an analysis of the strengths and the current limitations of using PageRank centrality as a behavior optimization tool for swarm robotics.Section 8 provides concluding remarks.

Context and related work
In state of the art, the problems of scalability discussed in the introduction have mostly been tackled in two ways.First, there are methods that try to deal with the broad solution space that comes as the number of robots increases.For example, Gomes et al. (2012) used novelty search to encourage a broader exploration of the solution space.The second way is to use global insights to aid the evolutionary process.For example, Duarte et al. (2016) partitioned complex swarm behavior into simpler sub-behaviors.Hüttenrauch et al. (2017), with a focus on deep reinforcement learning, used global information to guide the learning process toward a solution.Alternatively, Trianni et al. (2006) and Ericksen et al. (2017) explored whether evolved behaviors for smaller swarms could generalize to larger swarms.In all cases, the need to simulate the swarm remains a bottleneck, for both evaluating the behavior and generalizing the behavior beyond the parameters used in simulation.In this work, we offer a different solution which discards simulation and exploits a micro-macro link based on evaluating the relative PageRank score between local states using only a local model of a single robot in the swarm.It extracts performance parameters without simulation or propagation from an initial condition and only relies on an analysis of the model for a given policy.
This approach differs from other swarm modeling and evaluation solutions found in the literature, such as the multi-level modeling framework introduced by Lerman et al. (2001), Martinoli and Easton (2003), and Martinoli et al. (2004).There, the idea is to model the evolution of a swarm via probabilistic final state machines propagated in time.At the microscopic level, one probabilistic finite state machine is propagated from an initial condition for each robot in the swarm.At the macroscopic level, which can be readily abstracted from the microscopic model, the finite state machine describes the mean transition of robots between states.The macroscopic model probabilistically describes, at the global level, how many robots are in each state at a given point in time.This can also be expressed in terms of rate equations.For each level, the relevant transition rates between states are extrapolated from an analysis of the policy, as well as from geometric reasoning (e.g., based on the size of the arena and the expected robot density that ensues) (Martinoli et al. 2004) or from empirical data (Berman et al. 2007).To predict the global evolution of the swarm, the models are propagated in time from an initial condition, essentially simulating the swarm at an abstract level.For certain tasks, these models have been shown to be highly effective in predicting the general behavior of a swarm (Lerman et al. 2005).However, their accuracy and applicability are limited by the validity of the global assumptions which define the transition rates.For instance, to estimate the probability of interaction between any two robots, one assumption is that the system is "well mixed," meaning that the robots are all equally distributed within an area at all times and always have the same probability of encountering a neighbor.For reasons such as this, their use has been largely limited to the evaluation of tasks with few states and by swarms in bounded arenas.Examples where these models have been used are the modeling of collective decision making (Hamann et al. 2014;Reina et al. 2015), area exploration/foraging (Correll and Martinoli 2006;Campo and Dorigo 2007), or keeping an aggregate (Winfield et al. 2008). 2 For other tasks where the level of detail required is higher and/or where the global goal is achieved by a combination of several states, rather than all robots being in one state, this approach does not provide a sufficient level of detail.One such example is pattern formation, as treated in this paper, whereby the global goal is to achieve a particular spatial arrangement which cannot be described by such models.
Another class of macroscopic models, focused on spatial movements, models a swarm by using a diffusion model based on the Fokker-Planck equation (Hamann and Wörn 2008).This approach macroscopically models the general motion of a swarm under the assumption of Brownian motion.The Fokker-Planck equation provides an estimate for the density of robots in the environment as a result of the robots' motion.Prorok et al. (2011) explored the use of rate equations together with a diffusion model.This made it possible to study swarms in more complex environments where the swarm may not be well mixed, which otherwise causes drift errors (Correll and Martinoli 2006).However, there were still limitations on how to approximate more complex behaviors by the robots, which, by the nature of the assumption of Brownian motion, were limited to random walks.
The use of macroscopic models to optimize the behavior of robots in a swarm was explored by Berman et al. (2007Berman et al. ( , 2009Berman et al. ( , 2011) ) using models similar to those of the rate equations by Martinoli and Easton (2003).In these models, the fraction of robots in each state is modeled together with the rates at which the robots will transition between these states.It is then possible to optimize the transition rates such that the macroscopic model, on average, shows that the swarm settles in certain states of interest.In (Berman et al. 2009) this is a done for the problem of task allocation, where it is also proven that all robots in the swarm will settle to a desired global equilibrium.However, as for rate equations, this approach is limited by the global level assumptions that are being taken in order to describe the mean of the swarm (Berman et al. 2011).Moreover, the outcome from such approaches can also be dependent on the initial condition of the swarm (Hsieh et al. 2008).
Macroscopic approaches thus make it possible to approximate how a swarm can evolve in time.They purposely refrain from incorporating the detailed experience of a single robot and rather approximate the mean evolution of the entire swarm.This makes them effective prediction tools, but they are limited in their ability to describe interactions to a higher level of detail.In contrast to multi-level models, the PageRank framework captures state transitions probabilistically, rather than temporally.We can apply this to analyze the impact of the actions by a single robot in a dynamic environment.This makes it able to tackle more specific tasks such as consensus agreement or pattern formation, with results that are found to be scalable, flexible, and robust to initial conditions or the number of robots.

PageRank centrality as a micro-macro link for swarm robotics
Centrality measures assess the relative importance of nodes in a graph based on the graph's topology.Several of these measures exist, which capture centrality from different perspectives.For example, degree centrality computes the importance of nodes based on the number of edges connected to them.Alternatively, closeness centrality measures the average shortest path between a node and all other nodes.3This paper deals with PageRank centrality.This is a graph centrality measure for directed graphs that measures the importance of each node recursively.This means that the PageRank centrality of a node is a function of the centrality of the nodes pointing to it, the centrality of the nodes pointing to those nodes, and so forth.This recursiveness indirectly accounts for the topology of the entire network, and it models how likely the node is to be reached when traversing the graph.
In this section, we detail how the PageRank centrality algorithm works (Sect.3.1) and then explain how it can be used to microscopically model the possible state transitions of a robot in a swarm (Sect.3.2).In Sect.3.3, we then introduce a PageRank-based fitness function.This fitness function assesses, at a local level, a swarm's ability to achieve a desired global goal.It will be used to optimize the behavior for all tasks treated in this paper.

A review of PageRank centrality
Consider an arbitrary graph G = (V , E) with nodes V and edges E. Let u ∈ V be an arbitrary node in the graph.Following Page et al. (1999), a simplified expression for the PageRank R(u) of a node u ∈ V can be expressed as where B u is the set of all nodes pointing to u, N v is the number of outgoing edges of node v, and R(v) is the PageRank of node v. Equation 1 serves to show the basic concept behind PageRank: The PageRank of a node is a function of the PageRank of the nodes pointing to it.This means that being pointed to by a more important node will provide a node with a higher PageRank.In the case of the World Wide Web, for instance, this reflects how being linked to by a popular Webpage (whereby its popularity is also established in the same way) will then be evaluated as being more valuable than being linked to by a niche Webpage.PageRank can be calculated simultaneously for all nodes using an iterative procedure.Here, we briefly review its key elements. 4Let R be a vector that holds the PageRank of all nodes in V .
R is obtained once Eq. 2 converges such that where k is the iteration step and ε is a threshold (in this work we used ε = 10 −8 ).Equation 2can be shown to converge quickly provided that the matrix G is stochastic and primitive (Langville and Meyer 2006). 5G is known as the "Google matrix."In its full form, it is defined as where: -H is the adjacency matrix of graph G. H is a sub-stochastic transition probability matrix that describes the probability of transitioning between the nodes of graph G.For the example of the World Wide Web, H holds the hyperlink structure of the Web and it models how a Web surfer can navigate the Web by using hyperlinks.-D is the dangling node matrix.The dangling nodes of graph G are nodes with no outgoing edges (e.g., no hyperlinks), which result in empty rows in H.For the World Wide Web, if a user reaches a Webpage with no hyperlinks, then the user will resort to writing the name of a Webpage in the address bar.D describes these transitions and the probabilities thereof.The combined matrix S = H + D is a stochastic matrix.-E is known as the teleportation matrix.This is an additional element that describes random transitions between nodes that are not captured by the topology of the graph G.
For the World Wide Web, these transitions model how a Web surfer may choose, at any moment and on any Webpage, to manually type the address of a Webpage in the address bar instead of clicking through hyperlinks.In this case, the user will "teleport" to another Webpage regardless of whether a hyperlink to that Webpage exists.Note that matrices D and E have a non-null intersection.The information in D is included in E. Both matrices describe transitions that occur as a result of a user typing the name of a Webpage in the address bar, except that D only holds those transitions for the cases where the user, having reached a dangling node, has no other option than to teleport.-α is known as the expansion factor, where 0 ≤ α ≤ 1, which models the probability that a user follows hyperlinks as opposed to accessing a Webpage directly via the address bar.Note how Eq. 3 consists of two complementary terms.The first term models the transitions via hyperlinks (unless no hyperlinks exist as in the case of dangling nodes).The second term models "teleportation," described via E. α describes the relative probability between 4 For more details, including sample code and even some humorous anecdotes, we refer the reader to the book "Google's PageRank and Beyond: The Science of Search Engine Rankings" by Langville and Meyer (2006). 5A stochastic matrix holds the probability of transitioning between nodes as it would be described by Markov chains.It follows that the sum of each row must be equal to 1.A matrix A is primitive if ∃k∀(i, j) : A k i j > 0. A primitive matrix is both irreducible and aperiodic.Irreducible means that any state in a Markov chain is reachable from any other state.Aperiodic means that there is no set period for returning to any given state.These properties allow the iterative algorithm to converge.these two transition types.If α = 1, the user always only follows hyperlinks (when they are available).If α = 0, the user never follows hyperlinks and only teleports through the Web by typing Webpages in the address bar.Brin and Page (1998) advised to set α = 0.85.Note that the simplified version of PageRank from Eq. 1 featured α = 1.
In summary, the matrix G models how a user navigates the Web.H models the user's use of hyperlinks.D models what the user does when a Webpage with no hyperlinks is reached.E models the transitions that take place when the user chooses to not use hyperlinks, but directly go to a Webpage of choice.The matrices H, D, E, and the parameter α can then be tailored to a user's Web surfing behavior in order to produce more personalized results once the PageRank vector R is evaluated.

Using PageRank to model swarms
Just like Brin and Page used the Google matrix from Eq. 3 to model the behavior of a Web surfer, we can use it to model the behavior of a robot in the swarm.To do so, we must correlate the local experiences of a robot to a graph structure.We thus begin by defining the following discrete sets.
-Let S be the local state space of a robot.This is the set of local, discretized states that a robot can observe using its on-board sensors.The local states are going to be analogous to Webpages.-Let A be the set of all discrete local actions that a robot can take.
-Let be a stochastic policy (a stochastic map between states S and actions A) that the robot follows.This stochastic policy is analogous to the hyperlinks in the Web, as it allows robots to travel through local states.
Now consider a graph G S = (V , E).The graph G S is our microscopic local model of the robot in the swarm.The nodes of G S are the local states that the robots can be in, such that V = S.The edges of G S are all transitions between local states that could take place.The local transitions can be of two types: active or passive.That is, either they can happen as a result of an action by the robot (active), or they can happen because the environment around the robot changes (passive).More formally, G S is the union of two subgraphs: G a S and G p S .-G a S , whereby the superscript a stands for "active," holds all state transitions that a robot could go through by an action of its own based on the stochastic policy .The edges of this graph are weighted based on the relative probabilities in the stochastic policy and thus represent how likely it is that the robot will take the action when in a given local state.
-G p S , whereby the superscript p stands for "passive," holds all state transitions that a robot could go through because of changes in its environment, independently from .

The graphs G a
S and G p S model the robot and the effects of the environment on the robot, respectively.These models are specific to the type of robot that is being used, its state space, its action space, and the nature of the task.For each of the three tasks that we explore in this manuscript (consensus, pattern formation, and aggregation), we will show how to define G a S and G p S accordingly.Once properly defined, the graphs G a S and G p S can then be expressed as the matrices H, E, and D introduced in Sect.3.1 -H shall model how a robot navigates in its environment by its own actions, describing the possible local state transitions that may take place when the robot executes these actions.This is analogous to a user navigating through Webpages via hyperlinks.We thus define: where adj(G) denotes the weighted adjacency matrix of a graph G. -E models the "environment."These are state transitions that can happen to the robot even when it is not taking an action.For example, a neighbor could move away and cease to be a neighbor.This is analogous to a Web user who, instead of following hyperlinks through the Web, navigates to another Webpage via the address bar.In this case, the user instigates a state transition that is not described by the hyperlink structure (i.e., the "policy") of the World Wide Web.The matrix E is thus given by: -D shall model what can happen to a robot as a result of the changing environment around the robot, if and only if the robot reaches a state wherein it cannot (or will not) take any actions.These states are states that the policy does not map to any actions.We group these states in the set S static ⊆ S. Reaching such a state is analogous to reaching a Webpage with no hyperlinks.We define D as: Note that the matrices H, E, and D must be row normalized in order for the iterative procedure of Eq. 2 to converge.
The expansion factor α remains to be defined, which models the probability that a robot will take an action over being subject to changes in the environment, as modeled by E. One may choose to keep α constant, as originally suggested by Page et al. (1999).For instance, α could be a constant value that is a function of the robot density.However, this would be a global parameter and would thus not be representative of the local experience of the robots, unlike the remainder of the model.Instead, we propose to make α a function of the local state that the robot is in.For instance, this may reflect that, if the robot is in a local state whereby it is surrounded by several neighbors, the probability that it undergoes passive transitions due to actions of its neighbors may be higher than if it is in a state whereby it is not surrounded by neighbors at all.While exploring the three tasks analyzed in this paper, we will define different approaches to express α in more detail.

Using PageRank to evaluate the performance of the swarm
Once a microscopic model of the swarm has been constructed using the framework described in Sect.3.2, we can evaluate the PageRank centrality of all states in S.This will inform us on how likely it is for a robot in the swarm to reach each local state.In turn, this will be used to evaluate the performance of the whole swarm.
This is where we devise the micro-macro link based on PageRank centrality.The true metric that we wish to optimize is the efficiency of the swarm in achieving a given global goal.However, directly assessing this parameter would require us to simulate the swarm (which may lead to the scalability problems introduced at the beginning of this paper), or use a macroscopic model of the swarm (which may fail to capture the details of the policy).Therefore, instead of directly optimizing the stochastic policy against the global performance metric, we optimize the likelihood that a robot in the swarm will end up in certain local states of interest, i.e., local states that we know should be reached for the global goal to happen.For example, in the aggregation task, we will wish to achieve (and stay in) local states with neighbors over local states without neighbors.Alternatively, in the consensus task, the robot will aim to be in a state of agreement with its neighbors.Because all robots in the swarm are trying to achieve these "desired" local states as efficiently as possible, the global effect will then be that the swarm achieves the global goal more efficiently as well.
This concept can be formalized into a fitness function.Let S des ⊆ S be the set of local desired states pertaining to a global goal, and let R(s) be the PageRank centrality of a state s ∈ S. Based on this, we propose the following fitness function in order to evaluate the performance of the swarm: where R(s) is the PageRank of state s.It is extracted following the calculation of R from Eq. 2 with the model of Sect.3.2.This fitness function expresses the average PageRank of the states in S des in relation to the average PageRank of all states S. When used within an optimization strategy, our objective will be to alter so as to maximize F. This will maximize the average "popularity" of the states in S des and increase the likelihood for the individual robot to be in one of the states.At the global level, we expect that if all robots act such that they are likely to enter a state s ∈ S des , then the final global goal will also be more likely to emerge.Defining the set of desired states for a given task may appear troublesome.However, throughout this paper we present three different tasks and show that the desired states can often be intuitively extracted from a global goal.We will return to this discussion in Sect.7.

Sample task 1: consensus agreement
In this first task, a swarm of robots needs to achieve a consensus between a certain set of options.We will use the fitness function described in Sect.3.3 to optimize the probability with which a robot should change its opinion such that the entire swarm achieves consensus as quickly as possible.

Task description and setting
Consider a set of N robots that must collectively choose between M options.As an example, these options could be possible sites to visit together with the rest of the swarm.We focus on the case where all options are of equal value, meaning that the robots do not have a preference for any particular option.Let C be the set of M options that the swarm considers.The global goal of the swarm is for all robots to settle on a choice c ∈ C. We shall assume that the robots are in a static arbitrary connected configuration P.
The state s i of a robot R i holds the opinion of robot R i and the number of neighbors with each of the other opinions.As an example, consider a swarm that must choose between M = 2 options, whereby C = {A, B}.The local state space of a robot R i is then described by all possible combinations of these three variables: of the above form the local state space S for this task.If M > 2, then the S can be expanded accordingly to accommodate all other relevant combinations.
Based on its local state s i , a robot can choose to keep its opinion or change it.We set the action space A to A = C.Each robot follows a stochastic policy , dictating the probability of choosing an opinion c ∈ C (including its current one) for each state s ∈ S. We then define the set S des to be all states wherein the robot has the same opinion as all of its neighbors, for which we know a priori that there is no need to change opinion, as all robots are locally in agreement.These states are excluded from the stochastic policy , creating a global convergence point for the swarm.If all robots are in one of these local state, then it follows that consensus (which is the global goal) has been achieved.The size of the stochastic policy is

PageRank model
To define the PageRank model, we need to define G a S , which denotes the state transitions whenever a given robot takes an action, and G p S , which models the state transitions whenever the neighbors of the robot take actions.
Let us begin by defining G a S .In this case, whenever a robot R i changes its opinion between the M options that are available, only its internal choice changes, while the choice that its neighbors hold is not (directly) impacted by the action.Thus, for the example of M = 2, the only parameter of its state that changes is c i , while n i A and n i B remain constant.G a S is thus formed by a set of disconnected subgraphs, each where the robot is only capable of changing its own internal state, but all other parameters stay constant.A sample subgraph of G a S for M = 2 is depicted in Fig. 1a.G p S follows the same philosophy as G a S , but from the opposite perspective.The internal choice c i of a robot R i does not change when its neighbors change opinions.What changes is the number of neighbors with a certain opinion.The model assumes that the robot will always notice every time one of its neighbors changes opinion.G p S thus also takes the form of several disconnected subgraphs.An example of a subgraph for M = 2 is depicted in Fig. 1b for a robot with three neighbors.The matrices H, E, and D are extracted from G a S and G p S using Eqs.4, 5, and 6, respectively.Finally, we define the parameter α.For this task, we can reason that, if the robot is surrounded by more neighbors, then the likelihood that it will change opinion before one of its neighbors decreases.We can include this in our local model via a re-definition of α.To do so, we define a vector α v that holds a different value of α for each state s ∈ S, such that the equation for the Google matrix is modified to: where D α v is a diagonal matrix holding the vector α v .The entries of α v are α i ≤ 1 for i = 1, . . ., |S|.In this work, we model α i for a given state s i ∈ S using the following general definition: where p action (s i ) = a∈A P(a|s i ).Thus, p action (s i ) is the cumulative probability of taking an action (any action) from the stochastic policy when in state s i .All states s i ∈ S active feature 0 ≤ p action (s i ) ≤ 1.If p action (s i ) < 1, then there is also a probability that, when in state s i , the robot will not take an action but remain idle.The parameter n neighbors (s i ) is the number of neighbors at state s i .As this parameter increases, α i decreases.This same definition α will also be used for the pattern formation task in Sect. 5.
With the above, we have now fully defined a microscopic model of the swarm from the perspective of an individual robot while using PageRank's framework.We now proceed to optimize in order to maximize the fitness function expressed in Eq. 7.

Genetic algorithm setup and results
The optimization in this work is done via a genetic algorithm (GA).The GA features a population of ten scalar genomes, where the size of each genome is equal to .Each gene in a genome holds a value 0 ≤ p ≤ 1, indicating the probability of executing the corresponding state-action pair from .Each new generation is produced by elite members (30%), offspring (40%), and mutated members (30%).Offspring are generated from averaging two parents' genomes.Mutation replaces 10% of a genome's genes with random values from a uniform distribution.The initial generation was created by assigning random values between 0 and 1 to all genes in each genome, following a uniform distribution.Note that the cumulative probability of taking any of the actions when in a given state is always normalized, since all choices are part of the policy.
With this setup, we evolved behaviors for the case where M = 2 and M = 3.The results of the fitness increase over five evolutionary runs are shown in Fig. 2, where we show the best genome that was evolved for both the case where M = 2 (Fig. 2a) and the case where M = 3 (Fig. 2b).We now test the results of the evolved behavior to examine the magnitude by which the global performance of the swarm has improved.To do so, the swarm is simulated in discrete time whereby, at each time step, a random robot in the swarm changes its opinion.This models random concurrency in the swarm.For simplicity, we assume that the robots are placed on a grid world.Each robot R i is capable of sensing the opinion of any robot R j that happens to be in the eight grid points that surround it.We measure the performance of the swarm as the cumulative number of times that the robots in the swarm take an "action" before consensus is achieved, whereby an "action" is whenever a robot renews its opinion.This is indicative of how quickly the swarm is capable of reaching a consensus to one of the options.The results are compared to a basic baseline behavior where the robots choose between options with equal probability (except for the states S des , in which case they are in full agreement with their neighborhood and remain idle).In the following, the baseline behavior is denoted 0 and the evolved behaviors are denoted 1 .

(a)
The tests were performed for swarms of ten robots and 20 robots.Each swarm was evaluated 100 times, in random configurations and from random starting conditions.The results for a swarm that must choose between M = 2 options are shown in Fig. 3.The results for a swarm that must choose between M = 3 options are shown in Fig. 4. In all cases, the swarm is capable of achieving a consensus substantially faster than with the baseline behavior.Moreover, we see that the performance is robust to the number of robots in the swarm.These results show that the evolutionary procedure was capable of finding a local behavior that provides an efficient outcome at the global level and also adapts well to the number of robots, the initial conditions, and the spatial configuration of the robots.

Variant with limited binary cognition
To gain further insight into the adaptability of the framework, we consider a limited variant of the consensus task where robots are endowed with binary sensors.Each robot is only capable of sensing whether all of its neighbors agree with it or not, but is unable to discern the opinions of individual neighbors.Such a case reflects the impact of high noise or poor sensing abilities on the part of the robots.Consider the case where robots must choose between M = 2 options, with choices C = {A, B}.It follows that each robot in the swarm can be in one of four local states: s A0 : Internal opinion A, but the robot is not in agreement with all neighbors.s A1 : Internal opinion A, and the robot is in agreement with all neighbors.s B0 : Internal opinion B, but the robot is not in agreement with all neighbors.s B1 : Internal opinion B, and the robot is in agreement with all neighbors.
The local state space is S = {s A0 , s A1 , s B0 , s B1 }, and S static = S des = {s A1 , s B1 }.If in a state s / ∈ S des , a robot can choose between two actions.
The policy is evolved following the same evolutionary setup as in Sect.4.3.The only difference is that, as the robots are now incapable of sensing the number of robots in their neighborhood, we set α = 0.3 for all states, instead of using Eq. 9.The results of five evolutionary runs are shown in Fig. 5a.As expected, all evolutionary runs evolve to similar  1a, and the evolved policies, for all cases, are similar to the one given in Table 1b 123 In the original policy (denoted 0 ), the robots have equal probability of selecting either choice.In the evolved policy (denoted 1 ), the robots always switch their opinion whenever their neighborhood is not in agreement with their choice.For 1 , the reason that the values are not exactly 1 and 0 is attributed to the mutation strategy that was used during the evolution results.The performance improvement over a baseline case (whereby the robots alternate between states with equal probability) is shown in Fig. 5b, c for swarms of ten robots and 20 robots, respectively.

Analysis
The small size of the solution space for the variant of the consensus task studied in Sect.4.4 provides an opportunity to analyze the evolved policy and the global fitness landscape in detail.The baseline policy and the evolved policies are given in Table 1a, b, respectively.
In Table 1b, it can be seen that our evolutionary procedure determined that the best strategy would be for the robots to, almost deterministically, switch their opinion. 6We investigated whether this was an optimum solution by simulating the system for different probabilities of switching opinion.This was done using the same simulation setup as all evaluations in this section, whereby each setting was evaluated and averaged over 100 runs.
A first set of results for these simulations is shown in Fig. 6a, which shows the performance of the swarm in relation to the probability of switching.Here, it is confirmed that the policy is correct.As the probability of switching choices increases, the swarm is capable of achieving consensus more efficiently.It can also be seen that the evolved policy scales well with the size of the swarm.
We further investigated the entire solution space for a swarm of ten robots.Here, it is revealed that the evolved policy 1 , as given in Table 1b, is not the global optimum, but the unbiased global optimum.By unbiased, we mean that the robots are equally probable of selecting option A as they are of selecting option B. However, more optimal options exist if the robots become biased toward one of the two options.This result is actually reflected in the PageRank-based fitness function with which the behavior was evolved, as we had provided the set S des = {s A1 , s B1 } and gave equal importance to both options.It then follows that this brought us to the unbiased optimum.
To test whether our framework could adapt to this, we also evolved the policy for the cases where S des = {s A1 } and where S des = {s B1 }.Note that we kept S static = {s A1 , s B1 }, because Fig. 6 Depiction of global policy analysis for the probability of switching between two options A and B in the limited variant of the consensus task we still want the robots to stop if they are in agreement with their neighborhood.The results of these two evolutions are given in Table 2a, b for S des = {s A1 } and S des = {s B1 }, respectively.The evolution discovered the biased optimum for both options.This shows a flexibility on the part of the algorithm to adapt to the desired goals.Originally, the algorithm had evolved the unbiased policy because it had been told that both options had equal importance.When this constraint was altered to a new desired goal, the algorithm evolved a policy which reflected the new goal.
Additionally to the analysis above, the scalability of the evolved policies was analyzed in order to better understand their performance as swarm size increases.The comparison results showing the average performance of the policies are shown in Fig. 7a, b.This comparison is based on the results shown in the previous sections.It can be seen that the evolved policies scale well with the size of the swarm.The policy that was evolved with knowledge of the neighborhood scales gracefully, with only a minimal increase in the number of actions taken (per agent) as swarm size increases.A reason for this increase is the fact that, in a larger swarm, a robot is more likely to be surrounded by more neighbors, thus experiencing more uncertainty in its neighborhood.Furthermore, note that when a robot chooses to stay with its current choice, then this is also counted as an action.As expected, the limited variant evolved in Sect.4.4, by nature of the fact that the robots have less data on their local surroundings, begins to struggle more as the swarm size increases.

Sample task 2: pattern formation
This task deals with a swarm of robots with low awareness of their surroundings that must arrange into a desired spatial configuration.The robots are homogeneous (identical and without hierarchy), anonymous (do not have identities), reactive (memoryless), cannot communicate, do not have global position information, do not know the goal of the swarm, and operate asynchronously (i.e., by local clocks) in an unbounded space.The only knowledge  available to the robots is: (1) a common heading direction, such as north, and (2) the relative location of their neighbors within a maximum range.In previous work (Coppola et al. 2019), we have developed a local behavior with which such limited robots can always eventually arrange into a global desired pattern.Moreover, we provided a set of conditions to verify whether a given behavior would always eventually lead to the emergence of the desired pattern.
One issue that remained to be solved was that even simple patterns were found to take up to hundreds of actions (cumulatively by all robots) to emerge, and this number appeared to grow with the complexity of the pattern and the size of the swarm.Although this is to be expected, in light of the limited knowledge on the part of the robots, it is an issue that cannot be ignored if the system is to be used on real robots with limited battery life and other real-world time constraints.Solving this via classical approaches, wherein the swarm had to be ultimately simulated in order to find an optimum policy, proved unscalable (Coppola and de Croon 2018).This was because of all the issues listed in the introduction of this paper.In this section, we show how we can circumvent the scalability problem by tackling the optimization using the PageRank algorithm.

Description of approach to pattern formation
Such that this manuscript may be self-contained, this section summarizes the pattern formation methodology from Coppola et al. (2019), wherein a more detailed explanation can be found.For the sake of brevity, in this work we will assume that the swarm operates in a grid world and in discrete time.However, as demonstrated in Coppola et al. (2019), the behavior can also be used in continuous time and space with robots operating with local clocks.

Setting
Consider N robots that exist in an unbounded discrete grid world and operate in discrete time.
In the case studied in this paper, each robot R i can sense the location of its neighbors in the eight grid points that surround it, as depicted in Fig. 8a.This is the local state s i of the robot, which is all the information that it has.The local state space S consists of all combinations of neighbors that it could sense, such that |S| = 2 8 .At time step k = 0, we assume the swarm begins in a connected topology forming an arbitrary pattern P 0 .At each time step, one random robot in the swarm takes an action, whereby it is able to move to any of the 8 grid points surrounding it, as depicted in Fig. 8b.This is the action space of the robots, denoted A. If a robot takes an action at one time step, then it will not take an action at the next time step.This models the fact that a real robot would need some time to settle after each action and reassess its situation, leaving a time window for its neighbors to move.
The goal of the swarm is to rearrange from its initial arbitrary pattern P 0 into a desired pattern P des .This is achieved using the following principle.The local states that the robots are in when P des is formed are extracted, and form a set of local desired states S des ⊆ S, as depicted by the examples in Fig. 9.These local desired states are extracted as the observations of the robots once the pattern is formed, similarly to how puzzle pieces form a puzzle.If robot R i finds itself in any state s i ∈ S des , then it is instructed to not move, because, from its perspective, the goal has been achieved.Given a P des and the corresponding S des , it can be automatically (and sometimes even intuitively) verified whether the local desired states will uniquely form P des , or whether they can also can give rise to spurious global patterns.If spurious global patterns are not possible, then, until P des is formed, at least one robot will be in a state s / ∈ S des and will seek to amend the situation.The swarm will then keep reshuffling unless P des forms (Coppola et al. 2019).The robot is capable of moving omnidirectionally to any of the 8 grid points around its current position.These 8 actions form the actions space A.
Fig. 8 Depictions of a local state and the actions that a robot in the swarm can take Fig. 9 Set of desired states S des for the exemplary patterns treated in this paper, featuring patterns of increasing complexity and/or size.The set S des can be intuitively extracted for each pattern as the "puzzle pieces" that compose it.Note that the T with 12 robots has 11 states in S des .This is because the top state repeats.Also note that the line always only has three states in S des , because the center state can repeat indefinitely

Baseline behavior of the robots
When a robot R i is in a state s i / ∈ S des , it should execute an action from A. From the state space and action space, we can formulate a stochastic policy = S × A. However, not all actions should be allowed.The actions that: a) cause collisions and b) cause local separation of the swarm are eliminated from , because they are not "safe."Moreover, as explained in Sect.5.1.1,all states s ∈ S des do not take any actions.From this, we extract a final stochastic policy 0 , where 0 ⊆ .When a robot is in a state s, it will use the policy 0 to randomly select one possible action from the available state-action pairings.In Coppola et al. (2019), the final action was chosen from the available options based on a uniform probability distribution.
It is important to note that, from this pruning process, there also emerge additional local states that cannot take any actions.A robot in such a state will not be able to move or else it will either collide with other robots or possibly cause separation of the swarm.We refer to such states as blocked states.The set of blocked states is denoted S blocked .The states in S des and S blocked are functionally equivalent.In either case, a robot will not take any action.Together, they form the umbrella set S static = S des ∪ S blocked .
Furthermore, conceptually in contrast to static states, there are states where a robot will be capable of moving away from and/or around its neighborhood without issues.We call these states simplicial.Simplicial states are characterized by only having one clique, where we define a clique as a connected set of neighbors. 7The set of simplicial states is denoted S simplicial .Figure 10 shows examples of blocked states (Fig. 10a, b), a simplicial state (Fig. 10c), and a non-simplicial state (Fig. 10d).

Modeling the local experiences of the robots
Let us now construct the graph G S = (V , E) that models the local experiences of the robots.The nodes of G S are the local states that the robots can be in, such that V = S.The edges of G S are all possible transitions between local states.To help visualize G 1 S , G 2 S , and G 3 S , Fig. 11 shows a node and its successors for each graph.Figure 11a shows a node and its successors for G 1 S following an action based on a sample policy 0 .Figure 11b, c shows the same node in graphs G 2 S and G 3 S , respectively.As it can be seen, these three graphs bear a strong resemblance to the graphs G a S and G p S needed to define PageRank.We shall return to this in Sect.5.2, where the microscopic PageRank model for this task is defined.

Verifying that the pattern will always eventually emerge
In Coppola et al. (2019), it was shown that, by analyzing certain properties of G 1 S , G 2 S , and G 3 S , it can be verified that the pattern P des will eventually form starting from any initial pattern P 0 .This way, we can assess whether a policy is such that the final pattern will always eventually emerge, or whether it may cause the swarm to reshuffle endlessly without ever settling into the pattern.(We refer to this situation as a livelock.) The conditions are repeated here because, once we optimize the policy using PageRank, they shall be used as constraints in order to ensure that the final policy always eventually achieves the pattern.Specifically, the following conditions need to be met: shows that each state in S features a path to each state in S des .2. For all states s ∈ S static ∩ S ¬simplicial − s surrounded , none of the cliques of each state can be formed only by robots that are in a state s ∈ S des ∩ S simplicial .Here, s surrounded is the state that is surrounded by neighbors along all directions.

G 2r
S shows that all static states with two neighbors can directly transition to an active state.

G 1
S shows that any robot in state s ∈ S active ∩ S simplicial can travel around all its local neighbors, as exemplified in Fig. 10c (with the exception of when a loop is formed or when it enters a state s ∈ S static ).In G 3 S , any state s ∈ S static only has outward edges toward states s ∈ S active (with the exception of a state that is fully surrounded along two or more perpendicular directions).
The detailed motivations behind these conditions, published in Coppola et al. (2019), are not repeated here due to page constraints.In summary, they ensure that all robots will keep moving around with sufficient freedom for the swarm to reshuffle until the pattern is achieved.Condition 1 checks that it is possible to reach the final local desired states independently of the initial local states.Conditions 2 and 3 check that there is always at least one robot in the swarm that has the potential to move with sufficient freedom.Conditions 4 and 5 check that the free robot(s) is (or are) capable of sufficient exploration.
These conditions are local in nature.They analyze the local states of a robot based on its limited sensing range and the actions that the robot could take as a result.The advantage of this is that checking whether the conditions are met is independent of the size of the swarm, avoiding the combinatorial explosion that would otherwise ensue.Note that the conditions are sufficient, but not necessary.Fulfilling them means that the pattern will be achieved, but not fulfilling them does not mean the pattern will not be achieved. 8The first four patterns in Fig. 9 pass the proof conditions.The line, instead, does not fulfill all conditions.Moreover, it can be subject to spurious patterns (for example, a slanted H) as a result of S static .A more thorough discussion on the patterns that can be generated using this approach can be found in the original paper.
This verification procedure, combined with the fact that the behavior deals with very limited robots (anonymous, homogeneous, memoryless, with limited range sensing, and without needing any communication, global knowledge, or seed robots) moving in space, sets the work by Coppola et al. (2019) apart from other works such as the ones of Klavins (2007), Yamins and Nagpal (2008), or Rubenstein et al. (2014).Furthermore, we remind the reader that here we are dealing with robots that should move in space such that the swarm arranges into a desired spatial configuration.This is different in nature from altering the internal states of the robots into a particular repeating pattern, such as in the work of Yamins and Nagpal (2008).

PageRank model
To define the PageRank framework for this task, we need to define graphs G a S and G p S .This can be readily done based on the graphs G 1 S , G 2 S and G 3 S introduced in Sect.5.1.3.For G a S , the graph holds the local transitions that a robot will experience based on actions it takes from a stochastic policy.This is exactly the same as G 1 S , therefore: For G p S , the graph holds all the local transitions that happen whenever the environment causes a state transition (i.e., other robots take an action).This can be the result of actions by neighbors, as described by G 2 S , or actions of other robots which become neighbors, as described by G 3 S .Therefore, it follows that: The matrices H, E, and D are extracted from G a S and G p S using Eqs.4, 5, and 6, respectively.Concerning α, we follow the same definition as for the consensus task (see Eq. 9), whereby we define it based on the number of neighbors that a robot senses in a given state.This models how, with more neighbors, the robot is more likely to be subject to its environment.We once again have 0 ≤ p action (s) ≤ 1, where p action (s) is the cumulative probability of taking an action when in state s.If p action (s) < 1, then there is also a probability that the robot will not take an action and that it will remain idle.In the example shown in Fig. 11a, for instance, we would have p action (s i ) = p l + p r ≤ 1.Here, p l is the probability of taking an action to the left, and p r is the probability of taking an action to the right, according to the policy that is being used.

Optimization strategy
For this task, we wish to take care that the optimization procedure does not violate the conditions that tell us that the pattern will be formed, listed in Sect.5.1.4.For this reason, we have divided the optimization process in two phases.In the first phase, we only perform the removal of state-action pairs from 0 while keeping the conditions from Sect.5.1.4as constraints.From this phase, we will extract a policy 1 ⊆ 0 .After this is done, we remove the constraints and freely alter the probability of taking actions in 1 , except that we always keep a nonzero probability for all state-action pairs.We would like to make it clear to the reader that Phase 1 is not required, and it is possible to directly go to Phase 2. (We provide an example of that in Sect.5.7.)However, Phase 1 allows us to quickly reduce the solution space prior to Phase 2, while still ensuring that the pattern of interest can be formed.In both cases, we use the PageRank-based fitness function proposed in Eq. 7.
-Phase 1: state-action pair removal from 0 In Phase 1, state-action pairs are eliminated from the stochastic policy 0 with the goal of maximizing F. The input to this phase is the baseline policy that is extracted following the methods described in Sect.5.1.The output of this first phase is a stochastic policy 1 ⊆ 0 , whereby state-action pairs in 0 that do not help to achieve the pattern efficiently are automatically removed, while maximizing F as per Eq. 7. The optimization of Phase 1 is subject to the following two constraints: 1.The conditions from Sect.5.1.4must be respected This constraint checks that the pattern will always eventually be formed from any initial configuration.The conditions can be checked based from the local graphs G 1 S , G 2 S , and G 3 S , and therefore, the time required to check them scales with the local state-space size, and not the size of the swarm.

The final pattern must remain the unique emergent pattern As state-action pairs are
removed from 0 , it may be that additional states behave like states in S static and will not move.However, an important axiom needed to guarantee that P des will always form is that, for a swarm of N robots, N instances of the local states in S static , with repetition, must uniquely rearrange into P des .If this is not the case, another pattern could emerge where all robots are in a state S static and do not move.It must therefore be verified that 0 is not affected in such a way that this can happen.The method to verify this is described at the end of this subsection.
-Phase 2: probability optimization In Phase 2, the probability of executing individual the state-action pairs in 1 is altered so as to maximize the fitness F. This phase parallels the final optimization step of Coppola and de Croon ( 2018), with the key difference being that we now evaluate the performance of swarm using the PageRank-based fitness function in Eq. 7, rather than by simulating the swarm.The output of this second phase is a stochastic policy 2 .
Procedure to check the first constraint of Phase 1 Because we are already at the optimization stage, we consider a starting point whereby the original S static already guarantees that P des is unique.We then only need to check that adding new states to S static does not affect this property, which we can do at the local level.Consider a state s ∈ S active which has become a candidate to be moved to S static .For s, we locally check whether it could be fully surrounded by robots with a state within S static .If this is not possible, because s is such that at least one of its neighbors would be in an active state, then we can add s to S static .This is because we know that, if this state s were static, there would always be one active robot somewhere The genomes size is equal to the size of the original policy 0 extracted as described in Sect.5.1.2.Following Phase 1, which could remove state-action pairs, the size of the policy could be reduced.These smaller policies can then be optimized in Phase 2 next to it anyway, so P des still remains the only unique pattern that can be formed by static states.Furthermore, when this active neighbor moves, the local state of the robot will also change and it will also no longer be static.Note that, because of the importance of states in the S simplicial , these states are not allowed to transition into the set S static .

Phase 1: genetic algorithm setup and results
For Phase 1, we set up a GA with a population of ten binary genomes.Each gene in a genome represented a state-action pair in 0 , with a 1 indicating that the state-action pair was kept and a 0 indicating that it was eliminated.Each generation consisted of: elite members (30%), new offspring (40%), and mutated members (30%).Offspring genomes were the result of an AND operation between two parent genomes.Offspring were only kept if they complied with the constraints, else the parents were forced to look for new mates.On each generation round, mutation was applied to a random portion of the population, for which the NOT operator was randomly applied to 10% of each selected member's genome (thus changing 1s to 0s and vice versa).Similarly as to the offspring, a mutation was kept if and only if it returned a genome for which the constraints were met, else it was discarded and a new mutation was attempted until a successful one was found.This way it was guaranteed that the population always consisted of genomes that respected the constraints.Each valid genome was evaluated assigned a fitness F to as per Eq. 7. The genomes of the initial population were generated by performing the mutation procedure on a fully positive genome (all 1s).The genomes size for the four patterns was equal to | 0 | for each pattern, as indicated in Table 3.With this setup, five evolutionary runs were performed for the first four patterns of Fig. 9.9 The results of the evolutionary runs are shown in Fig. 12. Within 2000 generations, the fitness of all patterns approached convergence for all evolutionary runs, characterized by a steep increase in the first generations.
The performance of all the evolved policies was evaluated in simulation.The simulator was a direct implementation of the swarm on a grid as described in Sect.5.1.1.Each policy was evaluated 100 times with the swarm starting from random initial configurations, with the only requirement that they would begin in a connected topology.The global metric of performance was the amount of actions taken cumulatively by all robots before the desired final pattern emerged.This was compared to the original unoptimized policy 0 of each pattern.The results are shown as histograms in Fig. 13.It can be seen that all policies are able to achieve the pattern and significantly outperform the original unoptimized policy by approximately one order of magnitude, showing a successful application of PageRank as an optimization parameter.The difference of the evolved performance from the original was confirmed by a bootstrap test with 10,000 repetitions, using the sample mean and the ratio of the variances as the reference statistics.All distributions were declared different over the 95% confidence interval.We now continue to Phase 2 in order to further improve the performance of the swarm.

Phase 2: genetic algorithm setup and results
Phase 1 can achieve a stochastic policy 1 ⊆ 0 which can be seen to provide a global performance that is one order of magnitude faster than 0 .However, 1 still dictates that, when in a given state, a robot chooses from the available actions using a uniform distribution.This limitation is surpassed in Phase 2. The objective of Phase 2 is to optimize the probability of executing the state-action pairs in the policy, as was the case for the GA of the consensus task.The GA setup of Phase 2 is similar to the one of Phase 1, with two main differences: state-action in 1 .Naturally, for each state, the cumulative probability of taking one of the actions in 1 , denoted p action (s), cannot be larger than 1.If this happens, then the probabilities of the relevant state-action pairs in the genome are appropriately normalized.
Each new generation was produced by elite members (30%), offspring (40%), and mutated members (30%).Offspring were generated from averaging two parents' genomes.Mutation was applied to random genomes, for which 10% of their genes were replaced by random values of 0 < p ≤ 1 from a uniform distribution.The members of the initial population were produced by applying the mutation procedure above to a genome with uniform probability distributions, as extracted from Phase 1.The genome size is detailed in Table 3.Five evolutionary runs were conducted for each pattern for 1000 generations.The policy to be optimized was the best performing policy from Phase 1, for each pattern, based on the results of Phase 1 shown in Fig. 13.The results of the evolution of Phase 2 are shown in Fig. 14.The global performance of the final policies 2 was evaluated 100 times as in Phase 1. Once again, we measure the global performance by the cumulative number of actions that the robots in the swarm take before the desired pattern emerges.The results are shown in Fig. 15, where they are compared to the best policy obtained at the end Phase 1 for each pattern.For the triangle with four robots and the hexagon, we also compare these results to the performance of the optimized behavior from Coppola and de Croon (2018), which used simulation in order to directly optimize for the cumulative number of actions taken by the swarm.Sample trajectories of simulations are shown in Fig. 16.The improvements achieved for the triangle with four robots and the hexagon appear marginal.For two policies evolved for triangle with four robots, a bootstrap test (based on the difference in sample means) could not reject the hypothesis that the distribution of the final performance is equal to the performance at the end of Phase 1.This was the case for also one of the policies evolved for the hexagon.However, this is a justified result, because for both the triangle with four robots and the hexagon, the performance after Phase 1 already approached the performance of a policy which was optimized with the fitness being evaluated through simulations, which we take to be a near-limit performance.The evolutions of Phase 2 achieved policies that performed equivalently and, in some cases, seemingly even slightly better than the ones that were achieved in Coppola and de Croon (2018).In Fig. 15, it can be seen that a higher fraction of runs take fewer than ten actions and fewer than 100 actions for the triangle with four robots and the hexagon, respectively.These results show the power of PageRank's framework in being able to describe the swarm up to the point that the achieved performance matches the performance of results that are achieved with a global evaluation procedure.
For the triangle with nine robots and the T with 12 robots, Phase 2 instead shows significant improvements over Phase 1.These are most noticeable for the T with 12 robots, which reaches an improvement in performance of one order of magnitude compared to the policies from Phase 1. Overall, this means that the policies following Phase 2 outperform the unoptimized policy by two orders of magnitude.We remind the reader that both of these patterns could  2018).The latter is available only for the first two cases, the triangle with four robots and the hexagon, as extracting it for larger patterns leads to scalability issues) never have been tackled if, instead of using PageRank, we had used simulation in order to assess the performance of the swarm.Simulations would have been infeasible in light of scalability issues, as was indeed the case in Coppola and de Croon (2018).Instead, these results have been achieved with the use of an entirely local evaluation.

Analysis
The results of Phase 1 and Phase 2 both show that the evolutions successfully lead to policies with high performance boosts, which even match the performance of policies that were optimized using a fitness function based on a global simulation of the swarm.In this section, we shall now aim to gain more insight into how the PageRank-based fitness function was correlated with the global performance of the swarm.As we are now dealing with a large policy with a large number of parameters (see Table 3), the entire fitness landscape cannot be studied.Instead, we evaluated the global performance at various stages of the Phase 1 and Phase 2 evolutions to show how the performance improvements are correlated with the PageRank-based fitness.The performance is measured as the average number of actions required cumulatively by the swarm in order to achieve the pattern.This is based on 100 simulations for the highest fitness at each generation, sampled every 20 generations for the triangle with four robots and the hexagon, and every 50 generations for the triangle with nine robots.The dotted line shows the mean performance of the policy achieved by a global optimization in Coppola and de Croon (2018), only available for the first two patterns For Phase 1, the results of this evaluation are shown in Fig. 17.It can be seen that the performance of the swarm improves quickly in the beginning.This matches the fast rise in fitness observed in Fig. 12.Moreover, for the triangle with four robots and the hexagon, we see that the performance even begins to approach the one of the final global optimization from Coppola and de Croon (2018).Overall, Phase 1 shows a successful correlation between the PageRank-based fitness function and the global performance of the swarm.10For Phase 2, the results of the same analysis for the triangle with four robots, the hexagon, and the triangle with nine robots are shown in Fig. 18.Once again, it is possible to observe a general overall improvement in the performance.In all cases, over the first generations, the performance of the swarm improves.
It is in Phase 2 that, however, we also notice certain limitations.It appears that, beyond a certain point, some evolutionary runs begin performing worse as the evolution continues, which is not reflected in the fitness function.The effect is most clearly noticeable on: -the hexagon, starting from around the 100th generation, for all runs.
-the triangle with nine robots, where one of the evolutionary runs features a drastic spike in performance at approximately the 400th generation (after which it seems to recover).
These effects are likely the result of reaching a limitation of PageRank's microscopic model once the constraints from Phase 1 are removed.Eventually, there comes a point where the evolution over-fits to the microscopic model at the expense of the global performance.There are also a number of other effects at play: 1.The change in fitness is low compared to Phase 1 evolutions; hence, the effects of noise are more clearly noticeable.2. The impact of changing probabilities in a stochastic policy is less direct than completely removing a state-action pair from the policy.Therefore, the changes have less impact on the results.
(a) 4 robots triangle (b) Hexagon (c) 9 robots triangle Fig. 18 Evaluation of performance improvement during Phase 2 for all evolutionary runs.The performance is measured as the average number of actions required cumulatively by the swarm in order to achieve the pattern.This is based on 100 simulations for the highest fitness at each generation, sampled every 20 generations for the triangle with four robots and the hexagon, and every 50 generations for the triangle with nine robots.The dotted line shows the mean performance of the policy achieved by a global optimization in Coppola and de Croon (2018), only available for the first two patterns

Further investigations
We conclude this section with a further investigation, whereby we explore whether placing more importance on certain desired states can be beneficial or destructive to the pattern formation task.This is relevant to patterns that feature a repeating desired state.Consider, for example, the line along northeast as shown in Fig. 9.The line only requires three states in the set S des (bottom edge, middle piece, and top edge), yet can be generated with any number of robots by nature of a repeating middle state.We thus investigated whether, for long lines, placing more importance on the repeating middle state would improve the performance of the optimized policy.This was done with an altered version of the fitness function which calculated a weighted average of the desired states, rather than a normal average, as in Eq. 10.
where w(s) indicates the importance of a state s ∈ S des .We performed three different optimization runs with weight ratios 1:1:1, 1:8:1, and 1:18:1, relating to the weight of the bottom edge, middle piece, and top edge, respectively.For these optimizations, we directly used the optimization setup of Phase 2, with the fitness function of Eq. 10.The results of the evolution are shown in Fig. 19a.When the performance of the obtained policies was evaluated for a swarm of 20 robots, it was found that the policy that was optimized with a weight ratio of 1:1:1 featured the best performance, followed by 1:8:1, and finally by 1:18:1.This is shown in Fig. 19b, c.11These results lead to the insight that, at least for this type of task, it appears equally important for robots to reach all desired states simultaneously, and preference toward one may be detrimental.Even when a state repeats, the performance remains more optimal when all desired states are weighted equally in the fitness function

Sample task 3: aggregation
In this task, robots move within a closed arena with the goal of forming aggregates.Each robot can sense the location of neighbors within a range.This task is similar in nature to that of Correll and Martinoli (2011).Using PageRank, we optimize a stochastic policy that tells a robot whether it should perform a random walk or remain idle based on the number of neighbors that it can sense.The final policy is optimized without taking into account the lower-level implementation of the random walk, the number of robots in the swarm, the size of the environment, or the initial condition.Applying the framework to optimize a higherlevel policy such as this one provides further insight into the applicability and flexibility of the approach.

Task description
Consider N robots in a closed arena.The robots operate in continuous space (rather than a grid, as in previous tasks), and in continuous time, using local clocks.Each robot can sense the number of neighbors, m, within a limited range ρ max .We assume that each robot is capable of sensing a maximum of M neighbors, after which the sensor saturates.Therefore, the local state space S for this task is given by all possible numbers of neighbors that can be sensed by the robot, namely S = {0, 1, . . ., M}.The sensor updates its reading with a time interval t c .
Based on the number of neighbors sensed, a robot can choose to engage in one of the following two sub-behaviors: 1. Perform/resume a random walk behavior.We denote this action as a walk .2. Stop and remain idle.We denote this action as a idle .
These form the action space A = {a walk , a idle }.The policy , to be optimized using the PageRank algorithm, is the stochastic policy mapping the state space S to the actions space A. The set S des = {m min , m min + 1, . . ., M} is the set of desired states, where m min is the minimum amount of neighbors that we wish for a robot to observe.

PageRank model
To set up the PageRank framework, we begin by defining the active graph G a S , which describes the local state changes as a result of the actions of a robot.
The graph G a S is defined as a fully connected graph between all states, as exemplified in Fig. 20a for the case where M = 5.If the robot remains idle, then it will always keep its number of neighbors (remember that all changes in the environment, including neighbors leaving, are described in the passive graph, G p S ).We thus set the weights for all transitions to the same state to be equal to (s i , a idle ).All other edges are given weights according to the following: where p(m) is the probability of transitioning to a state with m neighbors.In our work, we set: p(m) = 1/(m + 1) 2 , albeit we have found that the same final policy is evolved also for other definitions, such as p(m) = 1/(m + 1).Therefore, we can see that it is sufficient to coarsely model the trend that the probability of finding larger groups decreases with the size of the group.The swarm designer does not need to concern oneself with the exact specification of how likely it is to encounter larger group sizes, which eases the design task.Graph G p S holds the transitions that a robot can experience regardless of its own choice to be idle or not.In this case, the robot can be in any state s ∈ S, at which point it could lose a neighbor or gain a neighbor at any point.The graph G p S takes the form shown in Fig. 20b.As done for the previous tasks, α is set as a function of the total probability of selecting a new action and the number of neighbors, as in Eq. 9. We also note here, however, that our evolutions returned the same final behavior also for simpler static definitions of α, such as α = 0.5, which would simply model that at all times the robot is equally subject to its decisions as it is to its environment.This is likely because the behavior that is being optimized is too high level for such details to make a real difference.(b) A final evolved policy, showing the probability of moving with respect to the number of neighbors that are sensed.The robots almost always move if the number of neighbors found is less than the desired amount, else they always remain idle.
Fig. 21 Details of the evolution of the policy for the aggregation task, showing the increase in fitness a final evolved policy

Genetic algorithm setup
We used GA setup as used for the consensus task in order to tune the probabilities of the policy for this task.We considered the case where m min = 2, meaning that the robots have the goal of forming aggregates of three or more.Unlike other cases, we did not set the policy values for the states S des to zero.Note that this is not necessary and actually makes the problem harder to solve, since more freedom is given to the optimizer with more parameters to be optimized.Setting the policy values for the states S des would have automatically forced a relative basic solution (i.e., "move when not in a group") upon the behavior.Thus, we instead took this as an opportunity to see whether PageRank would be able to tune its weights to also inform the robots to remain in the states of S des , without us providing this knowledge a priori.
The results of five evolutionary runs, showing the best genome for each generation, are shown in Fig. 21a.The evolved behavior was the same in all cases.The behavior we extracted from one evolutionary run is shown in Fig. 21b.Here, we can clearly see that the final evolved behavior is simple, yet very sensible.The extracted policy approaches a deterministic policy whereby if the number of neighbors is less than m min = 2, then the robot always chooses to move and explore the environment in the hope of finding more neighbors.If the number of neighbors is sufficient, then the robot remains idle.

Simulation environment and evaluation results
The performance of the evolved behavior was tested using a simulation environment called Swarmulator.Swarmulator is a C++-based simulator designed to efficiently simulate and prototype the behavior of swarms in continuous time and space.Each robot (together with its dynamics) is simulated and controlled by an independent detached thread, which limits simulation artifacts which may otherwise stem, for instance, from the robots being simulated always in the same order.In our simulations, the robots had the same dynamics as accelerated particles, allowing them to move omnidirectionally.The details of the simulated behavior were the following: -When initiating a random walk, a robot could randomly select a direction θ from a uniform distribution [0, 2π] and pursue it with a reference speed v.The decision to Setting d w = ρ max simulates the fact that the same sensor is used to sense both neighbors and the walls.-At all times, in direct addition to the velocity commands from the random walk behavior and wall avoidance, the robots were governed by attraction and repulsion forces.This handled collision avoidance.In our implementation, the attraction and repulsion behavior was tuned such that the equilibrium distance between robots was 1.5 m.
The system above was tested 50 times for a swarm of 20 robots in a 30 m × 30 m square arena.In all cases, the swarm successfully forms aggregates of three or more robots within the arena.Two sample runs are shown in Fig. 22.

Analysis
As for the variant of the consensus task from Sects.4.4 and 4.5, we took advantage of the small size of this policy in order to map and investigate the fitness landscape.The investigation was focused on testing the impact of the probability of moving within the arena when the robots did not have any neighbors and when the robot had one neighbor.Since the robots had the goal of being with two or more neighbors, these were the two main parameters of interest.
The simulation environment from Sect.6.4 was used to evaluate the performance of the swarm against different policies, in order to construct the fitness landscape with respect to our global parameter of interest.The results were obtained by averaging 50 simulations for each setting, and repeated for swarms of 20 robots and 50 robots in a 30 m × 30 m arena.(c) PageRank-based fitness landscape, where 1/F is the inverse of the fitness evaluated through Eq. 7.
Fig. 23 Global mapping of the solution space for the aggregation policy for different probabilities of motion with zero and one neighbors.p walk (m) is the probability of pursuing a random walk when surrounded by m neighbors The maximum simulated time allowed for the task was set to 1000 s.The results are shown in Fig. 23a, c, respectively.This fitness landscape, which is the global metric that we wanted to optimize for in reality, is compared to the fitness landscape shaped by the PageRank-based fitness function with which the behavior was actually evolved.The fitness landscape, shown in Fig. 23c, has a clear correlation with the global performance landscapes shown in the remainder of Fig. 23.The result of this analysis shows that the behavior that was evolved approaches an optimum that is also found at the global level.

Discussion
In Sects.5, 4, and 6, the PageRank framework has been applied to optimize the stochastic policy of robots in a swarm for three separate tasks.With the knowledge gained, this section summarizes the advantages of the proposed approach (Sect.7.1), its current limitations, and possible solutions (Sect.7.2) and proposes research directions that could lead to interesting extensions (Sect.7.3).

Advantages and properties of the proposed approach
The primary motivation for the development of the approach proposed in this paper was to find a method for which the time required to evaluate a behavior does not scale with the size of the swarm.Taking as example the tasks tested in this paper, if the fitness of the controller were to have been tested in simulation, then the time to be evaluated (wall-clock time) would have been dependent on a number of parameters, such as: -The number of robots in the swarm, which affects both the processing time required to simulate it and the time before which the global goal is achieved.-The size of the environment, which would alter the rate at which events occur.This would have been relevant for a task such as the aggregation task.
Furthermore, the final evolved solution would have been fitted to the following:

123
-The initial conditions from which the swarm is simulated.To avoid over fitting to a particular initial condition, each controller would have had to be tested from several random initial conditions.-The number of robots used in the simulations.To avoid over fitting to swarms of a specific size, each controller would have had to be tested on swarms of different sizes.-The global environment.If evolving within a specific environment, the final solution may implicitly be fitted to it.
Certain methods may help to mitigate the issues above when simulation is used, such as evolving for fewer robots and then testing on larger swarms (Trianni et al. 2006), or capping the simulation time to a maximum value (Scheper and de Croon 2016).Nevertheless, such solutions present new challenges of their own.The first, for instance, does not guarantee that the controller will be scalable.The latter, on the other hand, may limit the ability of the evolutionary algorithm to properly search the solution space.If the time limit is too low, then few genomes, if any, will succeed in the goal.Moreover, this means that the fitness is no longer assessed by whether, and how efficiently, the goal is achieved, but must be assessed based on how much of the goal has been achieved.This may require a different fitness function, a task which can likely cause additional difficulties and may even shape the final outcome (Nelson et al. 2009).
Using PageRank, the evaluation of the fitness function is based only on parameters extracted from a microscopic model.This has the following beneficial effects, which can be found back in the results of this paper: -The wall-clock time does not scale with the size of the swarm This can be appreciated directly from the results in this manuscript, such as for the pattern formation task, where we have managed to optimize the policy for patterns that we could not tackle using simulation to evaluate the fitness (Coppola and de Croon 2018).We have plotted the wallclock time required to evaluate the PageRank-based fitness for the pattern formation task in Fig. 24a.Each data point is averaged over 100 evaluations of PageRank for random instances of a stochastic policy.It can be seen that, once the maximum neighborhood size of 8 is fulfilled, the wall-clock time stops scaling and it remains constant even if the number of robots in the swarm increases.Instead of scaling with global parameters, the wall-clock time scales with the number of local states S and the possible transitions between them, as modeled by G a S and G p S .To test the practical impact of this, we evaluated the wall-clock time of the PageRank-based fitness function when varying the size of S (which also bring about more state transitions) for the separate tasks tested in this paper, averaged over 100 random policy iterations.The results are shown in Fig. 24b.It can be seen that the wall-clock time grows with the size of S. In comparison, the fitness wall-clock time for simulations directly grows with the size of the swarm.This is shown for the pattern formation task from Fig. 25.Note that this is despite using the simple grid simulation environment as was used for the simulations in Sects.5.4, 5.5, and 5.6.The performance of higher fidelity simulators would be worse.
-The evolved policy scales well with the number of robots in the swarm The policy is evaluated from a graph for which the number of robots in the swarm is not modeled, only the probability that certain events in the neighborhood of a specific robot may take place.For example, for the consensus task, the same exact policy could be applied to swarms of different sizes (we tested with swarms of ten and 20 robots), from random initial configurations, and consistently provided efficient results orders of magnitude better than the baseline case.-The evolved policy is not tuned to any particular initial condition of the swarm The proposed PageRank-based fitness function (Eq.7) evaluates the likelihood of achieving certain local states.However, as PageRank is evaluated for the entire graph simultaneously using the iterative procedure described in Sect.3, the score does not assume any initial condition by the robot, and, by extension, the swarm.This makes it robust to the initial conditions.In all results shown in this paper, there was no initial condition for which the evolved solution made it impossible to achieve the final global goal or for which the performance did not improve.-The evolved policy is not tuned to any particular lower-level controller In all instances, a policy was evolved without any specification of lower-level parameters.This can be best appreciated for the aggregation task.Here, we evolved a high-level policy without specifying the type of random walk that was going to be implemented, but only specifying the relative probability of encountering neighbors when a robot is moving.Had such a behavior been tuned in simulation, then it would have inherently adapted to the particular implementation of search behavior (e.g., wall-following, Lévy Flight, etc.), whereas this is not the case for our results, providing a more general controller.The relative probability of encountering neighbors subsumes a larger set of specific lowerlevel behaviors.Similarly, for the consensus task and the pattern formation task, the behavior is evolved without any particular specification of communication time, decision time, or motion controller.A simulation-based optimization might have adapted the probabilities to implicitly account for these specifications.Instead, the only concern for the PageRank-based optimization is with which probability it should transition between states.-The evolved policy is not tuned to any particular global environment PageRank's framework only models the possible transitions that may happen in the environment, but not the entire environment itself.Therefore, as long as the possible local transitions that may take place in a global environment are sufficiently well represented, then the evolved policy remains robust to different environments.In some instances, the global environment may also be excluded altogether.This was the case for the consensus task and the pattern formation task.

Current limitations and potential solutions
We have identified a set of limitations that pertain to the current implementation of the proposed approach.These are discussed below.When relevant, possible solutions or extensions are proposed.
-Discrete state space and discrete action space The current version of the approach relies on a discrete set of local states and a discrete set of actions.This may be limiting for applications with continuous states.The limitation comes from the fact that we are using a tabular policy which maps discrete states to discrete actions, which then allows us to construct a graph with discrete nodes and edges.Although this is a limitation of the current form, future work can explore the possibilities of extending the approach to deal with continuous state spaces.A research direction that would be interesting to follow is the possibility to use basis functions to interpolate over continuous state spaces.This type of solution has been proposed in the past for other approaches that also natively rely on a tabular policy, such as Q-learning, in order to accommodate continuous state spaces and/or action spaces (Sutton et al. 1998;Busoniu et al. 2010).
-"Missing out" on the global optimum Although the benefits of swarming are often credited as flexibility, robustness, and scalability (Hamann 2018), some applications may benefit from using policies which are tuned to a specific settings.For instance, if it is known a priori that a swarm of robots will always exclusively operate within a specific environment and be composed of a fixed number of robots, then it may be desired to find the global optimum for this particular setup.As an example, consider an aggregation task with a swarm of three robots.An optimal policy might feature that it is beneficial for two robots to wait for the third to come into their neighborhood.The number of robots would thus be implicitly encoded into the final policy.Instead, the locally evolved policy would not consider this option, as the number of robots would not be known nor modeled explicitly throughout the evolution.Although this means that the solution is likely flexible, or robust, it mean it is less globally optimal for the specific case.Moreover, as the approach is local in nature, it is not possible to directly evaluate global parameters such as the ideal number of robots for a given task.
-The need to define desired local states A complexity and potential limitation of using our approach is the need to define the set S des .This is the set of locally desired states that form the global goal.For the three tasks shown in this paper, S des was extracted with ease using intuition.Nevertheless, as task complexity grows, S des may become more difficult to define.A potential solution to this problem could be to use an automatic method which is capable of inferring the local desired states by observing a global behavior.This is discussed further in Sect.7.3.-Assumption of successful state transitions The microscopic model used to evaluate PageRank assumes that state transitions are successful in executing the policy.In reality, this may not always be the case in light of imperfect actuation or noisy sensors, which may cause unexpected state transitions that are not modeled.This limitation can be addressed at the robot level via proper filtering of sensory data as well as accurate lowerlevel control.From the policy optimization side, it would be interesting to see whether this issue can be tackled via an online refinement of the model, allowing the policy to adapt itself to the imperfections of the real world during deployment.We continue this discussion in Sect.7.3.

Further potential extensions
-Extension to online learning The current implementation uses an offline learning process, whereby the graphs G a S and G p S are devised based on a model of the robot and the task.There is also the possibility to extend the approach to online learning.The graphs G a S and G p S are based on the expected local states and transitions that the robot can experience.All these local states and all these transitions can be measured by the local sensors present on the robot.Therefore, this means that it is also possible for a robot to construct, expand, and/or tune the graphs G a S and G p S online directly based on its experiences during operation.In turn, it could then re-optimize its controller to suit these new graphs.For example, this could be done in order to automatically tune to the failure probability of certain actions using an online refinement of the transition rates.It can also be used to tune against "noisy" actuators which fail (with a given measurable probability) to follow through on an action, and either do not complete the action or create a different state transition than planned.Alternatively, it could adapt to systematic events in its environment, placing a higher probability on those transitions in the graph G p S .The parameter vector α v could also be estimated or refined online.Thanks to the generally low computational efforts required to run the optimization (notice the wall-clock times in Fig. 24), the optimization could also be performed on-board of a robot without requiring excessive computational power.
-Alternative fitness functions The fitness function proposed in this work (Eq.7) focused on maximizing the probability that a robot in the swarm would be in a desired state.The hypothesis that this would lead to global level improvements was found to be successful for each of the tested scenarios.Further investigations led us to the discovery that there may also be correlations to other parameters stemming from PageRank.For instance, using the results of the analysis from the pattern formation task from Sect.5.6, we have found that the variance of the centrality also appears to show a correlation with the global performance of the swarm.In future work, it would be interesting to see whether, and how, alternative fitness functions could be constructed in order to provide even larger global level benefits.-Extension to heterogeneous swarms This paper has tested the algorithm on homogeneous swarms of robots with one global goal.However, we wish to note that swarm homogeneity is not a direct limitation of the proposed algorithm.In fact, the "passive" graph G p S , which models the environment, already does not assume that the neighboring robots operate with the same policy.(Rather, it assumes that all possible changes in the environment are equally probable.)For a swarm of functionally heterogeneous robots, we anticipate the main difficulty to be in defining concurrent sets of desired states for each type of robot, which all combine toward a greater global goal.
-Including transitions with a probability of failure An assumption that was taken for the tasks in this work was that, if a robot decides to perform an action, then it will always succeed.In reality, however, this may not always be the case, as certain state-action pairs may have a lower success rate than anticipated in the real world.However, this failure probability can be directly factored in within the transition probabilities of the graphs G a S and G p S .On this note, it could also be possible for a robot to do this online while performing the task.
-Automatic extraction of desired states As stated in Sect.7.2, a limitation of the approach is the need to explicitly define the locally desired states.It can be postulated that this may not always be intuitively done by a swarm designer, particularly as task complexity increases.However, it can be argued that most (if not all) tasks require that some local states are more "desired" than others, and that these must be locally observable by the robot (even if not apparent, or understood, by the designer)-the challenge is to find them.In future work, this challenge could be solved by a multilayered learning approach, whereby a higher-level learner is tasked with extracting the desired local states for a given task.

Conclusion
The framework proposed in this paper is based on the idea to microscopically model the local experiences of a robot in a swarm by means of graphs.It then becomes possible to evaluate the PageRank centrality of the local states and use this as a measure of the global behavior of the swarm by evaluating the relative PageRank of local states of interest.This can be used within an optimization procedure in order to optimize the global behavior of swarms while only requiring a local analysis.This technique provides several advantages which natively include flexibility, robustness, and scalability, which are three key properties of swarm robotics.We have applied it to a pattern formation task, a consensus task, and an aggregation task.The time needed to perform the optimization did not scale with the size of the swarm, which allowed us to tackle swarm sizes that could not be tackled if simulation had been our means of assessment.Moreover, the results obtained were natively independent of global parameters such as the initial starting condition, the environment, or the number of robots in the swarm.The new insights presented in this paper offer a new strategy to evaluate the final behavior of groups from the perspective of the individual.We expect that the approach can generalize to several other tasks, whereby a robot in the swarm and its environment is modeled in analogy to how a user surfs the World Wide Web.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Fig. 1
Fig. 1 Representation of subgraphs of G a S and G p S for a consensus task with M = 2 options, C = {A, B}

Fig. 2 Fig. 3
Fig. 2 Evolution of policy for consensus agreement task with two choices (M = 2) and three choices (M = 3)

( a )Fig. 4
Fig.4Performance of consensus agreement task where M = 3, meaning that the robots must settle between three choices.The global performance is measured by the number of times that, cumulatively, the robots change their decision before a consensus is achieved

( a )
Fig. 5Evolution and performance of the policy for the binary variant of the consensus task.The baseline policy is given in Table1a, and the evolved policies, for all cases, are similar to the one given in Table1b

( a )
Fig. 7Comparison of mean performance of the evolved policies for both variants of the consensus task for the case where M = 2

( a )
Example of a local state s i ∈ S of a robot R i .The robot is shown in black and its neighbors within sensing range are shown in gray.It is assumed that a robot can always sense neighbors in the 8 grid points that surround it.

( b )
Possible actions that a robot can take.

Fig. 11
Fig. 11 Examples of a node and its successor nodes from graphs G 1 S , G 2 S , and G 3 S

Fig. 12
Fig. 12 Phase 1 evolutions for all patterns tested, showing the best fitness from each of the five evolutionary runs

Fig. 13
Fig. 13Normalized histograms showing the cumulative number of actions required before the global goal is achieved for the evolved policies 1 ⊆ 0 , extracted from the five evolutionary runs of Phase 1 (colored lines), compared to a baseline unoptimized performance of 0 for each pattern (black line) (Color figure online)

Fig. 14
Fig. 14 Phase 2 evolutions for all patterns tested, showing the best fitness from each of the five evolutionary runs

Fig. 15
Fig. 15Normalized histograms showing the cumulative number of actions required before the global goal is achieved for the evolved policies from Phase 2, in comparison with the best performing policy of 1 , and the globally optimized performance fromCoppola and de Croon (2018).The latter is available only for the first two cases, the triangle with four robots and the hexagon, as extracting it for larger patterns leads to scalability issues)

Fig. 16 Fig. 17
Fig. 16 Exemplary simulations showing the evolution of the swarm with policy 2 for each pattern studied, given random initial conditions.The red dots indicate starting positions, and the green dots indicate final positions (Color figure online)

Fig. 19
Fig.19Study on the impact of changing the relative importance of states in S des .Even when a state repeats, the performance remains more optimal when all desired states are weighted equally in the fitness function

Fig. 20
Fig. 20 Depictions of G aS and G p S models for the aggregation task

Fig. 22
Fig. 22Depiction of two simulations of the aggregation task using the evolved policy together with the simulation environment described in Sect.6.4.On the left figure, the robots assemble into one cluster.On the right figure, the robots assemble into multiple clusters.Multiple clusters may happen due to the low sensing range of the robots in comparison with the size of the arena(Correll and Martinoli 2011)

( a )
Global performance of the solution space for a swarm of 20 robots in a 30 m× 30 m arena.tc is the mean time taken before the task completed.(b)Global performance of the solution space for a swarm of 50 robots in a 30 m× 30 m arena.tc is the mean time taken before the task is completed.

( a )
Change in wall-clock time with the number of robots for the pattern formation task.Once the swarm reaches 9 or more robots (such that the local neighborhood of a robot could be entirely filled with 8 other robots) the wall-clock time does not scale.(b) Change in wall-clock time as the size of the local state space increases.

Fig. 24
Fig. 24Trends in wall-clock time of PageRank-based fitness evaluation, denoted t e v.When using the PageRank centrality measure, the wall-clock time scales with the size of the local state space.tev denotes the average wall-clock time over 100 random policy evaluations.These have been computed using a MATLAB script on a Dell latitude 7490 laptop with Intel®Core™i7-6600U CPU @ 2.60GHz × 4, Intel®HD Graphics 520 (Skylake GT2), 8GB RAM, equipped with SSD, and running Ubuntu 16.04

Fig. 25
Fig. 25Trends in wall-clock time to evaluate the performance of a policy in a simulation environment for the pattern formation task, evaluated for 50 randomly generated policies for which the wall-clock time was averaged over ten iterations.Note that the plot features a log scale on both axes.The simulator used was a simple grid simulation implemented in MATLAB.The machine details are the same as used for the results in Fig.24

Table 1
Baseline policy and evolved policy for the binary variant of the consensus task for the case of M = 2

Table 2
Evolved policies with a bias toward options A or B Fig. 10Examples of: a a state s ∈ S blocked , due to it being surrounded; b a state s ∈ S blocked , because any motion will cause the swarm to locally disconnect; c a state s ∈ S active ∩ S simplicial , because it can travel around all its neighbors; d a state s ∈ S active but s / ∈ S simplicial , because it can move but it cannot travel around all its neighbors or else it might disconnect the swarm For this task, we break down G S in three subgraphs.-G 1indicates all state transitions that a robot could go through by an action of its own, based on 0 .-G 2 S indicates all state transitions that a robot could go through by an action of its neighbors, which could also move out of view.-G 3 S indicates all state transitions that a robot could go through if another robot, previously out of view, were to move into view and become a new neighbor.where neighbors stay in view, and does not hold the ones where neighbors fall out of view.
S be a subgraph of G 2 S .G 2r S only holds the state transitions in G 2 S

Table 3
Genome size during Phase 1 for the four patterns tested