Turing learning: a metric-free approach to inferring behavior and its application to swarms

We propose Turing Learning, a novel system identification method for inferring behavior. Turing Learning simultaneously optimizes models and classifiers. The classifiers are provided with data samples from both an agent and models under observation, and are rewarded for discriminating between them. Conversely, the models are rewarded for 'tricking' the classifiers into categorizing them as the agent. Unlike other methods for system identification, Turing Learning does not require predefined metrics to quantify the difference between the agent and models. We present two case studies with swarms of simulated robots that show that Turing Learning outperforms a metric-based system identification method in terms of model accuracy. The classifiers perform well collectively and could be used to detect abnormal behavior in the swarm. Moreover, we show that Turing Learning also successfully infers the behavior of physical robot swarms. The results show that collective behaviors can be directly inferred from motion trajectories of a single agent in the swarm, which may have significant implications for the study of animal collectives.


2010,
. One application of system identification is the reverse engineering of agent behavior (biological organisms or artificial agents). For example, evolutionary computation has proven to be a powerful method to automate the generation of models, especially for behaviors that are hard to formulate Lipson, 2005, 2007].
A limitation of current system identification methods is that they rely on predefined metrics, such as the square error, to measure the difference between the output of the models and that of the system under investigation. Model optimization then proceeds by minimizing the measured differences. However, for complex systems, defining a metric can be non-trivial and case-dependent. It may require prior information about the systems. Moreover, an unsuitable metric may not distinguish well between good and bad models, or even bias the identification process. This paper overcomes these problems by introducing a system identification method that does not rely on predefined metrics.
A promising application of such a metric-free method is the identification of collective behaviors, which are emergent behaviors that arise from the interactions of numerous simple agents [Camazine et al., 2001]. Inferring collective behaviors is particularly challenging, as the agents not only interact with the environment but also with each other. Typically their motion appears stochastic and is hard to predict [Helbing and Johansson, 2011]. For instance, given a swarm of simulated fish, one would have to evaluate how close its behavior is to that of a real fish swarm, or how close the individual behavior of a simulated fish is to that of a real fish. Characterizing the behavior at the level of the swarm (that is, an emergent behavior) is difficult [Harvey et al., 2015]. It may require domain-specific knowledge and not discriminate among alternative individual rules that exhibit similar collective dynamics [Weitz et al., 2012]. Characterizing the behavior at the level of agents is also difficult, as even the same individual fish in the swarm is likely to exhibit a fundamentally different trajectory every time it is being looked at.
In this paper, we propose Turing Learning, a novel system identification method that allows a machine to infer the behavior of an agent in an autonomous manner. The method optimizes two populations simultaneously: one of models, and the other of classifiers. Each classifier is provided with a data sample, which originates from an observation of either the agent or a model. The classifier outputs a Boolean value indicating where the sample is believed to come from. The classifier receives a reward if and only if it makes the correct judgment. The reward thus depends solely on the classifier's ability to discriminate between the agent and models. Conversely, the reward for the models depends solely on their ability to 'trick' the classifiers into categorizing them as the agent.
Turing Learning does not rely on predefined metrics for measuring the similarity of behavior between models and agent; rather, the metrics (classifiers) are produced automatically in the learning process. The method is inspired by the Turing test [Turing, 1950, Saygin et al., 2000, Harnad, 2000, which machines can pass if behaving indistinguishably from humans. Similarly, the models could pass the tests by the classifiers if behaving indistinguishably from the agent. We hence call our method Turing Learning.
In the following, we examine the ability of Turing Learning to infer the behavior of swarms of agents. The agents are the robots of existing swarm robotic systems. This allows us to compare the obtained models to the ground truth. We present two case studies on canonical problems in swarm robotics: selforganized aggregation [Gauci et al., 2014b], and object clustering [Gauci et al., 2014c]. The corresponding behaviors are reactive; in other words, each agent maps its inputs (sensor readings) directly onto the outputs (actions). The problem of inferring this mapping is challenging, as the inputs are unknown. Instead, all relevant information has to be extracted from the agents' data samples (motion trajectories). A replica is mixed into a group of these agents. The replica executes behavioral rules defined by the model, and also produces data samples (motion trajectories). We show that by observing the motion trajectories (but not knowing the sensory values that produced these), Turing Learning automatically infers the behavioral rules underpinning the collective behaviors.
We originally presented the basic idea of Turing Learning, along with preliminary simulations, in [Li et al., 2013[Li et al., , 2014. This paper extends our prior work as follows: • It presents an algorithmic description of Turing Learning; • It shows that Turing Learning outperforms a metric-based system identification method in terms of model accuracy; • It demonstrates, to the best of our knowledge for the first time, that system identification can infer the behavior of swarms of physical robots; • It examines in detail the usefulness of the classifiers; • It examines through simulation how parameters of the agents' morphology (their field of view) and brain (controller) can be simultaneously inferred.
This paper is organized as follows. Section 2 discusses related work. Section 3 describes Turing Learning and the general methodology of the two case studies. Section 4 investigates the ability of Turing Learning to infer behaviors of swarms of simulated robots. Section 5 presents a real-world validation of Turing Learning with a swarm of physical robots. Section 6 concludes the paper.

Related Work
Turing Learning is a system identification method that simultaneously optimizes a population of models and a population of classifiers. The objective for the models is to be indistinguishable from the system under investigation. The objective for the classifiers is to distinguish between the models and the system. The idea of Turing Learning was first proposed by [Li et al., 2013], who presented a coevolutionary approach for inferring the behavioral rules of a single agent. The agent moved in a simulated, one-dimensional environment.
Classifiers were rewarded for distinguishing between the models and agent. In addition, they were able to control the stimulus that influenced the behavior of the agent. This allowed the classifiers to interact with the agent during the learning process. Turing Learning was subsequently investigated with swarms of simulated robots [Li et al., 2014]. Goodfellow et al. [2014] proposed generative adversarial nets (GANs). GANs, while independently invented, are essentially based on the same idea as Turing Learning. The authors use GANs to train models for generating counterfeit images that resemble real images, for example, from the Toronto Face Database (for further impressive examples, see [Radford et al., 2016]). They simultaneously optimize a generative model (producing counterfeit images) and a discriminative model that estimates the probability of an image to be real. The optimization is done using a stochastic gradient descent method.
In a work reported in [Herbert-Read et al., 2015], humans were asked to discriminate between the collective motion of real and simulated fish. The authors report that the humans could do so even though the data from the model is consistent with the real data according to predefined metrics. Their results "highlight a limitation of fitting detailed models to real-world data". They argue that "observational tests [...] could be used to cross-validate model" (see also [Harel, 2005]). This is in line with Turing Learning. Our method, however, also automatically generates the models as well as the classifiers, and thus does not require human observers.
While Turing Learning can in principle be used with any optimization algorithm, our implementation relies on coevolutionary algorithms. Metric-based coevolutionary algorithms have already proven effective for system identification [Bongard and Lipson, 2004a,b, 2005, 2007, Koos et al., 2009, Mirmomeni and Punch, 2011, Le Ly and Lipson, 2014. A range of work has been performed on simulated agents. In [Bongard and Lipson, 2004b], the authors proposed the estimation-exploration algorithm, a nonlinear system identification method to coevolve inputs and models in a way that minimizes the number of inputs to be tested on the system. In each generation, the input (test) that leads, in simulation, to the highest disagreement between the models' predicted output was carried out on the real system. The quality of models was evaluated through quantitatively comparing the output of the real system and the models' prediction. The method was applied to evolve morphological parameters of a simulated quadrupedal robot after it had undergone 'physical' damage. In a later work [Bongard and Lipson, 2004a], the authors reported that "in many cases the simulated robot would exhibit wildly different behaviors even when it very closely approximated the damaged 'physical' robot. This result is not surprising due to the fact that the robot is a highly coupled, non-linear system: thus similar initial conditions [...] are expected to rapidly diverge in behavior over time". The authors addressed this problem by using a more refined comparison metric reported in [Bongard and Lipson, 2004a]. In [Koos et al., 2009], an algorithm which also coevolves models and inputs (tests) was presented to model a simulated quadrotor and improve the control quality. The tests were selected based on multiobjective performances (e.g., disagreement ability of models as in [Bongard and Lipson, 2004b] and control quality of a given task). Models were then refined through comparing their predicted trajectories (corresponding to the selected tests) with those of the real system. In these works, predefined metrics were critical for evaluating the performance of models. Moreover, the algorithms are not applicable to the scenarios we consider here, as there are no exogenous inputs in the swarm systems under observation (the same would be the case for a swarm of biological agents).
Some studies also investigated the implementation of evolution directly in physical environments, on either a single robot [Floreano and Mondada, 1996, Zykov et al., 2004, Bongard et al., 2006, Koos et al., 2013, Cully et al., 2015 or multiple robots [Watson et al., 2002, O'Dowd et al., 2014, Heinerman et al., 2015. In [Bongard et al., 2006], a four-legged robot was built to study how it can infer its own morphology through a process of continuous self-modeling. The robot ran a coevolutionary algorithm on-board. One population evolved models for the robot's morphology, while the other evolved actions (inputs) to be conducted on the robot for gauging the quality of these models. Note that this approach required knowledge of the robot's sensor data. O'Dowd et al.
[2014] presented a distributed approach to coevolve on-board simulators and controllers for a swarm of ten robots to perform foraging behavior. Each robot had its own simulator which modeled the environment. The evolution of each robot's simulator was driven by comparing the real-world foraging efficiency of its nearby neighbors each executing the best controller generated by their own simulators. Each robot had a population of controllers, which evolved according to the robot's on-board simulator. The best controller was chosen for performing real-world foraging. This physical/embodied evolution helped reduce the reality gap between the simulated and physical environments [Jakobi et al., 1995]. In all these approaches, the model optimization is based on predefined metrics (explicit or implicit).
The use of replicas can be found in ethological studies in which researchers use robots that interact with animals to study their behaviors [Vaughan et al., 2000, Halloy et al., 2007, Faria et al., 2010, Halloy et al., 2013. Robots can be created and systematically controlled in such a way that they are accepted as conspecifics or heterospecifics by the animals in the group [Krause et al., 2011]. For example, in [Faria et al., 2010], a replica fish, which resembled sticklebacks in appearance, was created to investigate two types of interaction: recruitment and leadership. In [Halloy et al., 2007], autonomous robots which executed a derived model were mixed into a group of cockroaches to modulate their decision-making of selecting a shelter. The robots behaved in a similar way to the cockroaches. Although the robots' appearance was different to that of the cockroaches, the robots released a specific odor that the cockroaches could perceive them as conspecifics. In these works, the models were manually derived and the robots were only used for model validation. We believe that this robot-animal interaction framework could be enhanced through Turing Learning, which autonomously infers the collective behavior.

Methodology
In this section, we present the Turing Learning method and show how it can be applied to two case studies in swarm robotics.

Turing Learning
Turing Learning is a method that generates models of behavior. The agent under "investigation" could be natural or artificial. In either case, Turing Learning needs data samples originating from the behavior. For example, if the behavior of interest was to shoal like fish, data samples could be trajectory data from fish. If the behavior was to paint in a particular style (e.g., Cubism), data samples could be photos of paintings with this style. In addition, data samples would need to be produced by the models. The models could take the form of computer programs, or be physical, such as robots.
Simultaneously to model generation, Turing Learning generates classifiers. Given a data sample, the classifiers judge where it comes from. Does it originate from a real fish or a model? Does it originate from a Cubist painter or a model? This setup is akin of a Turing test; hence the name Turing Learning.
The models and classifiers are competing. The models are rewarded for misleading the classifiers, whereas the classifiers are rewarded for making correct judgements. Turing Learning thus optimizes models for producing indistinguishable behavior. This is in contrast to other system identification methods, which optimize models for producing as similar as possible behavior. The Turing test inspired setup allows for model generation irrespective of whether suitable similarity metrics are known.
In principle, the models can take any form. The model must however be expressive enough to produce data samples that-from an observer's perspective-are indistinguishable from those of the agent. If the classifiers have access to only a subset of the agent's outputs (e.g., its motion trajectories), any agent characteristic not influencing this output can not be learned.
The classifiers can be any algorithm that takes a sequence of data as input and produces a binary output. In principle, more complex outputs could be considered, such as ones providing confidence levels.
Algorithm 1 provides a description of Turing Learning. We assume a population of M models and a population of N classifiers. After an initialization stage, Turing Learning proceeds in an iterative manner until a termination criterion is met.
In each iteration cycle, data samples are obtained from observations of both the agent and models. The classifiers may influence the sampling process. This would enable a classifier to choose the conditions under which the behavior is observed [Li et al., 2013]. In this case, independent data samples should be generated for each classifier. In the case studies considered here, the classifiers do not influence the sampling process. Therefore, the same set of data samples is provided to all classifiers of an iteration cycle. For simplicity, we assume that obtain data samples (model j) 9: for each sample, obtain and store output of classifier i end while 16: end procedure each of the N classifiers is provided with K data samples for the agent and with one data sample for every model.
A model's reward is determined by its ability of misleading classifiers to judge it as the agent. Let m ij = 1 if classifier i (wrongly) classified the data sample of model j, and m ij = 0 otherwise. The reward of model j is then given by: A classifier's reward is determined by how well it judges data samples from both the agent and models. By default, the reward of classifier i is given by: The specificity of a classifier denotes its true negative rate: the percentage of agent data samples that were correctly identified as such (i.e., the tests were negative). Formally, where, a ik = 1 if classifier i (correctly) classified the k th data sample for the agent, and a ik = 0 otherwise. The sensitivity of a classifier denotes its true positive rate: the percentage of model data samples that were correctly identified as such (i.e., the tests were Using the reward values, r m and r c , the model and classifier populations are improved. In principle, any population-based optimization method can be used.

Case Studies
We examine the ability of Turing Learning to infer the behavioral rules reported for two canonical problems in swarm robotics [Blum andGroß, 2015, Bayındır, 2016]: self-organized aggregation [Gauci et al., 2014b], and object clustering [Gauci et al., 2014c].

Agents
The agents move in a two-dimensional, continuous space. They are differentialwheeled robots. The speed of each wheel can be independently set to [−1, 1], where −1 and 1 correspond to the wheel rotating backwards and forwards, respectively, with maximum speed. Fig. 1 shows the agent platform, the epuck [Mondada et al., 2009], which is used in the experiments.
Each agent is equipped with a line-of-sight sensor that detects the type of item in front of it. We assume that there are n types (e.g., background, other agent, object [Gauci et al., 2014b,c]). The state of the sensor is denoted by The swarm behaviors investigated in this paper use a reactive control architecture [Brooks, 1991]. Each agent maps the input (I) onto the outputs, that is, a pair of predefined speeds for the left and right wheels, (v I , v rI ), v I , v rI ∈ [−1, 1]. Given n sensor states, the mapping can be represented using initial configuration after 60 s after 180 s after 300 s Figure 2: Snapshots of the aggregation behavior of 50 agents in simulation.
2n system parameters, which we denote as: Using p, any reactive behavior for above agent can be expressed. In the following, we consider two example behaviors in detail. We investigate the ability of Turing Learning to infer these behaviors as well as 1000 randomlygenerated reactive behaviors.
Aggregation. In this behavior, the sensor is binary, that is, n = 2. It gives a reading of I = 1 if there is an agent in the line of sight, and I = 0 otherwise. The environment is free of obstacles. The objective for the agents is to aggregate into a single compact cluster as fast as possible. Further details, including a validation with 40 physical e-puck robots, are reported in [Gauci et al., 2014b].
The aggregation controller was found by performing a grid search over the space of possible controllers [Gauci et al., 2014b]. The controller exhibiting the highest performance was: When I = 0, an agent moves backwards along a clockwise circular trajectory (v 0 = −0.7 and v r0 = −1.0). When I = 1, an agent rotates clockwise on the spot with maximum angular speed (v 1 = 1.0 and v r1 = −1.0). Note that rather counterintuitively, an agent never moves forward, regardless of I. With this controller, an agent provably aggregates with another agent or a quasi-static cluster of agents [Gauci et al., 2014b]. Fig. 2 shows snapshots from a simulation trial with 50 agents.
Object Clustering. This behavior uses n = 3 sensor states: I = 0 if the sensor is pointing at the background (e.g., the wall of the environment, if the latter is bounded), I = 1 if the sensor is pointing at an object, and I = 2 if it is pointing at another agent. The objective of the agents is to arrange the objects into a single compact cluster as fast as possible. Details of this behavior, including a validation using 5 physical e-puck robots and 20 cylindrical objects, are presented in [Gauci et al., 2014c]. The controller's parameters, found using an evolutionary algorithm [Gauci et al., 2014c], are: p = (0.5, 1.0, 1.0, 0.5, 0.1, 0.5) .
When I = 0 and I = 2, the agent moves forward along an anti-clockwise circular trajectory, but with different linear and angular speeds. When I = 1, it moves forward along a clockwise circular trajectory. Fig. 3 shows snapshots from a simulation trial with 5 agents and 10 objects.

Models
The models are executed on replicas of the agents described in Section 3.2.1. In other words, the replica has the same differential drive and line-of-sight sensor as the agents. 1 We assume that the agents are unable to store information during run time. Thus, Turing Learning has to infer the input-output mapping, represented by the parameters in Eq (5). In simulation, the search space for the algorithm is unbounded; hence, the replica can move with arbitrary speed. The explicit representation makes it possible to objectively measure the quality of the obtained models in the post-evaluation analysis. Note that the classifiers, which determine the rewards of the models, have no knowledge about the representation or the inputs the models and agents receive.
By default, we consider the situation that some replicas are mixed into a group of agents. The case of studying groups of agents and groups of replicas in isolation is, however, also investigated (see Section 4.9).

Classifiers
The classifiers need to discriminate between the agents and models. We use the term individual to refer to either the agent or a replica executing a model.
A data sample comes from the motion trajectory of an individual observed for the duration of a trial. We assume that it is possible to track both the individual's position and orientation. The sample comprises the linear speed (s) and angular speed (ω). 2 The classifier is represented as an Elman neural network [Elman, 1990]. The network has i = 2 inputs (s and ω), h = 5 hidden neurons and one output neuron. Each neuron of the hidden and output layers has a bias. The network thus has a total of (i + 1)h + h 2 + (h + 1) = 46 parameters, which all assume values in R. The activation function used in the hidden and the output neurons is the logistic sigmoid function, which has the range (0, 1) and is defined as: The final value of the output neuron is used to make the judgment: model, if its value is less than 0.5, and agent otherwise. The network's memory (hidden neurons) is reset after each judgment.

Optimization Algorithm
The optimization of models and classifiers is realized using an evolutionary algorithm. We use a (µ + λ) evolution strategy with self-adaptive mutation strengths [Eiben and Smith, 2003] to optimize either population. The optimization algorithm can be thought of consisting of two sub-algorithms: one for the model population, and another for the classifier population. The sub-algorithms do not interact with each other, except for the reward calculation described in Section 3.1. The implementation of the evolutionary algorithm is detailed in [Li et al., 2013].
For the remainder of this paper, we adopt terminology used in evolutionary computing, and refer to the reward of solutions as their fitness and to iteration cycles as generations.

Termination Criterion
The algorithm stops after running for a fixed number of iterations.

Simulation Experiments
In this section, we present the simulation experiments for the two case studies, including simulation platform, simulation setups and the results obtained.

Simulation Platform
We use the open-source Enki library [Magnenat et al., 2011], which models the kinematics and dynamics of rigid objects, and handles collisions. Enki has a built-in 2-D model of the e-puck. The robot is represented as a disk of diameter 7.0 cm and mass 150 g. The inter-wheel distance is 5.1 cm. The speed of each wheel can be set independently. Enki induces noise on each wheel speed by multiplying the set value by a number in the range (0.95, 1.05) chosen randomly with uniform distribution. The maximum speed of the e-puck is 12.8 cm/s, forwards or backwards. The line-of-sight sensor is simulated by casting a ray from the e-puck's front and checking the first item with which it intersects (if any). The range of this sensor is unlimited in simulation.
In the object clustering case study, we model objects as disks of diameter 10 cm with mass 35 g and a coefficient of static friction with the ground of 0.58, which makes it movable by a single e-puck.
The robot's control cycle is updated every 0.1 s, and the physics is updated every 0.01 s.

Simulation Setups
In all simulations, we used an unbounded environment. For the aggregation case study, we used groups of 11 individuals-10 agents and 1 replica that executes a model. The initial positions of individuals were generated randomly in a square region of sides 331.66 cm, following a uniform distribution (average area per individual = 10000 cm 2 ). For the object clustering case study, we used groups of 5 individuals-4 agents and 1 replica that executes a model-and 10 cylindrical objects. The initial positions of individuals and objects were generated randomly in a square region of sides 100 cm, following a uniform distribution (average area per object = 1000 cm 2 ). In both case studies, individual starting orientations were chosen randomly in [−π, π) with uniform distribution.
We performed 30 runs of Turing Learning for each case study. Each run lasted 1000 generations. The model and classifier populations each consisted of 100 solutions (µ = 50, λ = 50). In each trial, classifiers observed individuals for 10 s at 0.1 s intervals (100 data points).

Analysis of Inferred Models
In order to objectively measure the quality of the models obtained through Turing Learning, we define two metrics. Given a candidate model (candidate controller) x and the agent (original controller) p, where x ∈ R 2n and p ∈ [−1, 1] 2n , we define the absolute error (AE) in a particular parameter i ∈ {1, 2, . . . , 2n} as: We define the mean absolute error (MAE) over all parameters as:    4 shows a box plot 3 of the parameters of the inferred models with the highest fitness value in the final generation. The fitness of the models depends on the solutions (classifiers) from the competing population, and is hence referred to as the subjective fitness. It can be seen that Turing Learning identified the parameters for both behaviors with good accuracy (dotted black lines represent the ground truth, that is, the parameters of the observed swarming agents). In the case of aggregation, the means (standard deviations) of the AEs in the parameters were (from left to right in Fig. 4(a)): 0.01 (0.01), 0.01 (0.01), 0.07 (0.07), and 0.06 (0.04). In the case of object clustering, these values were: 0.03 (0.03), 0.04 (0.03), 0.02 (0.02), 0.03 (0.03), 0.08 (0.13), and 0.08 (0.09).
We also investigated the evolutionary dynamics. Fig. 5 shows how the model parameters converged over generations. In the aggregation case study (see Fig. 5(a)), the parameters corresponding to I = 0 were learned first. After around 50 generations, both v 0 and v r0 closely approximated their true values (−0.7 and −1.0). For I = 1, it took about 200 generations for both v 1 and v r1 to converge. A likely reason for this effect is that an agent spends a larger proportion of its time seeing nothing (I = 0) than other agents (I = 1)-simulations revealed these percentages to be 91.2% and 8.8%, respectively (mean values across 100 trials).
In the object clustering case study (see Fig. 5(b)), the parameters corresponding to I = 0 and I = 1 were learned faster than the parameters corre-  sponding to I = 2. After about 200 generations, v 0 , v r0 , v 1 and v r1 started to converge; however it took about 400 generations for v 2 and v r2 to approximate their true values. Note that an agent spends the highest proportion of its time seeing nothing (I = 0), followed by objects (I = 1) and other agents (I = 2)-simulations revealed these proportions to be 53.2%, 34.2% and 12.6%, respectively (mean values across 100 trials).
Although the inferred models approximate the agents well in terms of parameters, it is not uncommon in swarm systems that small changes in individual behavior can lead to vastly different emergent behaviors, especially when using large numbers of agents [Levi and Kernbach, 2010]. For this reason, we evaluated the quality of the emergent behaviors that the models give rise to. In the case of aggregation, one measure of the emergent behavior is the dispersion of the swarm after some elapsed time as defined in [Gauci et al., 2014b] 4 . For each of the 30 models with the highest subjective fitness in the final generation, we performed 30 trials with 50 replicas executing the model. For comparison, we also performed 30 trials using the agent (see Eq. (6)). The set of initial configurations was the same for the replicas and agents. Fig. 6(a) shows the dispersion of agents and replicas after 400 s. All models led to aggregation. We performed a statistical test 5 on the final dispersion of the individuals between the agents and replicas for each model. There was no statistically significant difference in 26 out of 30 cases (30 out of 30 cases with Bonferroni correction).
In the case of object clustering, we use the dispersion of the objects after  [Graham and Sloane, 1990]. See Section 4.3 for details. 400 s as a measure of the emergent behavior. We performed 30 trials with 25 individuals and 50 objects for the agent and each model. The results are shown in Fig. 6(b). In a statistical test on the final dispersion of objects by the agent or any of the models (replicas), there was no statistically significant difference in 24 out of 30 cases (26 out of 30 cases with Bonferroni correction).

Comparison with a Metric-Based System Identification Method
In order to compare Turing Learning against a metric-based approach, we used an evolutionary algorithm with a single population of models. The algorithm was identical to the model optimization sub-algorithm in Turing Learning except for the fitness calculation. In each generation, the reciprocal of the total square error between the model's and the agents' linear and angular speed sequences in a trial was used as the model's fitness. The square error of a model is defined as follows: where s  Each evolution run lasted 1000 generations. The simulation setup and number of fitness evaluations for the models were kept the same as in Turing Learning. Fig. 7a shows the parameter distribution of the evolved models with highest fitness in the last generation over 30 runs. The distributions of the evolved parameters corresponding to I = 0 and I = 1 are similar. This phenomenon can be explained as follows. In the identification problem that we consider, the method has no knowledge of the input, that is, whether the agent perceives another agent (I = 1) or not (I = 0). The metric-based algorithm seems to construct controllers that do not respond differently to either input, but work as good as it gets on average, that is, for the particular distribution of inputs, 0 and 1. For the left wheel speed both parameters are approximately −0.54. This is almost identical to the weighted mean (−0.7 * 0.912 + 1.0 * 0.088 = −0.5504), which takes into account that parameter v 0 = −0.7 is observed around 91.2% of the time, whereas parameter v 1 = 1 is observed around 8.8% of the time (see also Section 4.3). The parameters related to I = 1 evolved well as the agent's parameters are identical regardless of the input (v r0 = v r1 = −1.0). Similar results have been found when inferring the object clustering behavior (see Fig. 7b).
By comparing Figs. 4 and 7, one can see that Turing Learning outperforms the metric-based evolutionary algorithm in terms of model accuracy in both case studies. As it was argued before, due to the unpredictable interactions in swarms the traditional metric-based method is not suited for inferring the behavior. Instead, Turing Learning automatically synthesizes its own metric based on the data observed.

Analysis of Generated Classifiers
The primary outcome of the Turing Learning method (and of any system identification method) is the model, which has been discussed in the previous section. However, the generated classifiers can also be considered as a useful byproduct. For instance they could be used to detect abnormal agents in a swarm. We will now analyze the performance of the classifiers. For the remainder of this paper, we consider only the aggregation case study.
To assess the performance of the classifiers, we measure the percentage of correct judgments over agents and a wide range of models. The models are uniformly distributed across the entire parameter space of the agents: [−1, 1] 4 . To keep the analysis of classifiers within a reasonable computation time, we discretize this space using 11 settings per parameter, to obtain: X = {−1.0, −0.8, ..., 0.8, 1.0} 4 . This discretized space is a grid consisting of |X | = 11 4 = 14641 points (i.e., models). The classifier's performance is computed as follows. The model is executed by a replica mixed into a group of 10 agents (as in the Turing Learning runs). 10 trials are performed using a set of initial configurations common to all classifiers. The motion data is fed to each classifier, which makes 10 judgments per individual. If the classifier consistently judges the individual as a model (i.e., not an agent) in 10 out of 10 trials, it outputs a "model" decision. Otherwise, it outputs "agent". This conservative approach was used to minimize the risk of false positive detection of abnormal behavior. The accuracy combines the percentage of correct judgments about models (50% weight) with the percentage of correct judgments about agents (50% weight), analogous to the reward definition in Eq. (2). The accuracy of the classifier increases in the first 5 generations, then drops and fluctuates within range 62%-80%.

Using a Single Classifier
An alternative strategy is to select the classifier that achieves the highest fitness when evaluated on the whole historical tracking data (not just those of the current generation). The decision accuracy of this classifier is also shown in Fig. 8 (best classifier (archive)). The trend is similar to that of best classifier (subjective). The accuracy increases in the first 10 generations, and then starts decaying, dropping to around 65% by the 1000 th generation. However, in the earlier generations, the accuracy of the best classifier (archive) is higher than that of the best classifier (subjective). For a comparison, we also plot the highest decision accuracy that a single classifier achieves for each generation (best classifier (objective)). Interestingly, the accuracy of the best classifier (objective), which is shown in Fig. 8 (black curve), increases almost monotonically, reaching a level above 95%. Note that to select the best classifier (objective), one needs to perform additional trials (146410 in this case).
At first sight, it seems counterintuitive that both the classifier (subjective) and the classifier (archive) have a low decision accuracy. This phenomenon, however, can be explained when considering the model population. We have shown in the previous section (see especially Fig. 5(a)) that the models converge rapidly at the beginning of the coevolutions. As a result, when classifiers are evaluated in later generations, the trials are likely to include models very similar to each other. Classifiers that become overspecialized to this small set of models (the ones dominating the later generations) have a higher chance of being selected during the evolutionary process and in the post-evaluation. These classifiers may however have a low performance when evaluated across the entire model space.

Using a Classifier Group
The results of the previous section have shown that using a single classifier is not a good solution; although there may be a good classifier in each generation, it may take significant effort (trials with agents) to find it.
To address this problem, we propose the use of a classifier group, that is, a number of classifiers working in tandem to judge a given candidate. We choose the best 10 classifiers 6 . This is either the set of classifiers with the highest subjective fitness in the current generation or the set of classifiers that achieve the highest fitness when evaluated on the whole historical data. If one or more classifiers make a decision about the candidate as a model (i.e., not an agent), the system outputs a "model" decision. Otherwise, it outputs "agent".
The results of using a classifier group are shown in Fig. 8 (green and magenta, respectively). Both types of classifier groups exhibit significantly improved de- cision accuracy across all generations. After 1000 generations, each system has a high accuracy of above 95%, on average, and performs thus similarly well to best classifier (objective).

Observing Only a Subset of Agents
So far, we have used motion data about all agents in the swarm when evaluating classifiers. However, this may not always be feasible in practice. For instance, given a video recording of a large and/or dense swarm, extracting motion data about all agents may be infeasible or lead to highly inaccurate results. A more practical solution might be to only track a subset of agents (e.g., by equipping them with markers).
We now compare the case where all agents are observed with the other extreme, where only one agent is observed. We study how these two cases compare as the swarm size increases. We conducted 30 runs of Turing Learning with each number of agents n ∈ {1, 5, 10, 50, 100}. There was always one replica in the group. When observing only one agent, this was chosen randomly in each trial. Note that the total number of trials in each run for the case of observing all agents and one agent is identical. We measured the total square error of the model with the highest subjective fitness in the last (1000 th ) generation of each run. The results are shown in Fig. 9.
There is no statistically significant difference for any n. On the other hand, as the swarm size increases, performance improves. For example, there is a statistically significant difference between n = 10 and n = 100, both when observing all agents and one. These results suggest that the key factor in Turing Learning is not the number of observed agents, so much as the richness of information that comes from increasing inter-agent interactions, and is reflected in each agent's motion. This means that in practice, observing a single agent is sufficient to infer accurate models, as long as the swarm size is sufficiently large.
A similar phenomenon was observed in a different scenario, where the goal was to distinguish between different 'modes' of a swarm (i.e., global behaviors) through observing only a few individuals [Eldridge and Maciejewski, 2014].

Inferring Control and Morphology
In the previous sections, we assumed that we fully knew the agents' morphology (i.e., structure), and only their behavior (controller) was to be identified. We now present a variation where one aspect of the morphology is also unknown. The replica, in addition to the four controller parameters, takes a parameter θ ∈ [0, 2π] rad, which determines the horizontal field of view of its sensor, as shown in Fig. 10 (the sensor is still binary). Note that in the previous sections the agents' line-of-sight sensors can be considered as sensors with a field of view of 0 rad.
The models now have five parameters. As before, we let Turing Learning run in an unbounded search space (i.e., now, R 5 ). However, as θ is necessarily bounded, before a model was executed on a replica, the parameter corresponding to θ was mapped to the range (0, 2π) using an appropriately-scaled logistic sigmoid function. The controller parameters were directly passed to the replica. In this setup, the classifiers observed the individuals for 100 s in each trial (preliminary results indicated that this setup requires a longer observation time). Fig. 11(a) shows the parameters of the subjectively best models in the last (1000 th ) generations of 30 runs. The means (standard deviations) of the AEs in each model parameter were: 0.02 (0.01), 0.02 (0.02), 0.05 (0.07), 0.06 (0.06), and 0.01 (0.01). All parameters including θ were still learned with high accuracy.
The case where the true value of θ is 0 rad is an edge case, because given an arbitrarily small > 0, the logistic sigmoid function maps an unbounded domain of values onto (0, ). This makes it simpler for Turing Learning to infer this parameter. For this reason, we also considered another scenario where the agents' angle of view is π/3 rad rather than 0 rad. The controller parameters for -2 achieving aggregation in this case are different from those in Eq. (6). They were found by re-running a grid search with the modified sensor. Fig. 11(b) shows the results from 30 runs with this setup. The means (standard deviations) of the AEs in each parameter were: 0.04 (0.04), 0.03 (0.03), 0.05 (0.06), 0.05 (0.05), and 0.20 (0.19). The controller parameters were still learned with good accuracy. The accuracy in the angle of view is noticeably lower, but still reasonable.

Inferring Other Behaviors
The aggregation controller that agents used in our case study was originally synthesized by searching over the space [−1, 1] 4 , using a metric to assess the swarm's global performance [Gauci et al., 2014b]. Other points in this space lead to different global behaviors that can be 'meaningful' to a human observer (e.g., circle formation [Gauci et al., 2014a]).
We now investigate whether Turing Learning can infer arbitrary controllers in this space, irrespective of the global behaviors they lead to. We generated 1000 controllers randomly in [−1, 1] 4 , with uniform distribution. For each controller we performed one run, and selected the subjectively best model in the last (1000 th ) generation. Fig. 12(a) shows a histogram of the MAE of the inferred models. The distribution has a single mode close to zero, and decays rapidly for increasing values. Over 89% of the 1000 cases have an error below 0.05. This suggests that the accuracy of Turing Learning is not highly sensitive to the particular behavior under investigation (i.e., most behaviors are learned equally well). Fig. 12(b) shows the AEs of each model parameter. The means (standard deviations) of the AEs in each parameter were: 0.01 (0.05), 0.02 (0.07), 0.07 (0.6), and 0.05 (0.20). We performed a statistical test on the AEs between the model parame- ters corresponding to I = 0 (v 0 and v r0 ) and I = 1 (v 1 and v r1 ). The AEs of the inferred v 0 and v r0 were significantly lower than those of v 1 and v r1 . This was likely due to the reason reported in Section 4.3; that is, an agent is likely to spend more time seeing nothing (I = 0) than other agents (I = 1) in each trial.

Separating Replicas and Agents
In our two case studies, the replica was mixed into the group of agents. However, in some collective behaviors, abnormal agent(s) influence the swarm [Bjerknes and Winfield, 2013]. For the same reason, the insertion of a replica that exhibits different behavior or is not recognized as conspecific may disrupt the behavior of the swarm and hence the models obtained may be biased. In this case, an alternative method would be to isolate the influence of the replica(s). We performed an additional simulation study where the agents and replicas were never mixed. Instead, each trial focused on either a group of agents, or of replicas. All replicas in a trial executed the same model. The group size was identical in both cases. The tracking data of the models and agents from each sample were then fed into the classifiers for making judgments.
The distribution of the inferred model parameters is shown in Fig. 13. The results show that Turing Learning can still identify the model parameters well. There is no significant difference between either approach in the case studies considered in this paper. The method of separating replicas and agents is rec- model parameter value of model parameters Figure 13: The model parameters inferred by a variant of Turing Learning that observes swarms of aggregating agents and swarms of replicas in isolation, thereby avoiding potential bias. Each box corresponds to the models with the highest subjective fitness in the 1000 th generation of 30 simulation runs.
ommended if potential biases are suspected.

Physical Experiments
In this section, we present a real-world validation of Turing Learning. We explain how it can be used to identify the behavior of a swarm of real agents. The agents and replica are represented by physical robots. We use the same type of robot (e-puck) as in simulation. The agents execute the aggregation behavior described in Section 3.2.1. The replicas execute the candidate models. We use two replicas to speed up the identification process, as will be explained in Section 5.3.

Physical Platform
The physical setup, shown in Fig. 14, consists of an arena with robots (representing agents or replicas), a personal computer (PC) and an overhead camera. The PC runs the Turing Learning algorithm. It communicates with the replicas, providing them models to be executed, but does not exert any control over the agents. The overhead camera supplies the PC with a video stream of the swarm. The PC performs video processing to obtain motion data about individual robots. We will now describe the physical platform in more detail.

Agents
Overhead Camera  Figure 15: Schematic top-view of an e-puck, indicating the locations of its motors, wheels, camera and infrared sensors. Note that the marker is pointing towards the robot's back.

Robot Arena
The robot arena was rectangular with sides 200 cm × 225 cm, and bounded by walls 50 cm high. The floor had a light gray color, and the walls were painted white.

Robot Platform and Sensor Implementations
A schematic top view of the e-puck is shown in Fig. 15. We implemented the line-of-sight sensor using the e-puck's directional camera, located at its front. For this purpose, we wrapped the robots in black 'skirts' (see Fig. 1) to make them distinguishable against the light-colored arena. While in principle the sensor could be implemented using one pixel, we used a column of pixels from a sub-sampled image to compensate for misalignment in the camera's vertical orientation. The gray values from these pixels were used to distinguish robots (I = 1) against the arena (I = 0). For more details about this sensor realization, see [Gauci et al., 2014b]. We also used the e-puck's infrared sensors, in two cases. Firstly, before each trial, the robots dispersed themselves within the arena. In this case, they used the infrared sensors to avoid both robots and walls, making the dispersion process more efficient. Secondly, we observed that using only the line-of-sight sensor can lead to robots becoming stuck against the walls of the arena, hindering the identification process. We therefore used the infrared sensors for wall avoidance, but in such a way as to not affect inter-robot interactions 7 . Details of these two collision avoidance behaviors are provided in the online supplementary material [Li et al., 2016].

Motion Capture
To facilitate motion data extraction, we fitted robots with markers on their tops, consisting of a colored isosceles triangle on a circular white background (see Fig. 1). The triangle's color allowed for distinction between robots; we used blue triangles for all agents, and orange and purple triangles for the two replicas. The triangle's shape eased extraction of robots' orientations.
The robots' motion was captured using a camera mounted around 270 cm above the arena floor. The camera's frame rate was set to 10 fps. The video stream was fed to the PC, which performed video processing to extract motion data about individual robots (position and orientation). The video processing software was written using OpenCV [Bradski and Kaehler, 2008].

Turing Learning with Physical Robots
Our objective is to infer the agent's aggregation behavior. We do not wish to infer the agent's dispersion behavior, which is periodically executed to distribute already-aggregated agents. To separate these two behaviors, the robots (agents and replicas) and the system were implicitly synchronized. This was realized by making each robot execute a fixed behavioral loop of constant duration. The PC also executed a fixed behavioral loop, but the timing was determined by the signals received from the replicas. Therefore, the PC was synchronized with the swarm. The PC communicated with the replicas via Bluetooth. At the start of a run, or after a human intervention (see Section 5.3), robots were initially synchronized using an infrared signal from a remote control. Fig. 16 shows a flow diagram of the programs run by the PC and the replicas, respectively. Dotted arrows indicate communication between the units.
The program running on the PC has the following states: • P1: Wait for "Stop" Signal. The program is paused until "Stop" signals are received from both replicas. These signals indicate that a trial has finished.
• P2: Send Model Parameters. The PC sends new model parameters to the buffer of each replica.
• P3: Wait for "Start" Signal. The program is paused until "Start" signals are received from both replicas, indicating that a trial is starting.
• P4: Track Robots. The PC waits 1 s and then tracks the robots using the overhead camera for 5 s. The tracking data contains the positions and orientations of the agents and replicas.
• P5: Update Turing Learning Algorithm. The PC uses the motion data from the trial observed in P4 to update the rewards (fitness values) of the corresponding two models and all classifiers. Once all models in the current iteration cycle (generation) have been evaluated, the PC also generates new model and classifier populations. The reward calculation and optimization algorithm are described in Sections 3.1 and 3.2.4, respectively. The PC then goes back to P1.
The program running on the replicas has the following states: • R1 : Send "Stop" Signal. After a trial stops, the replica informs the PC by sending a "Stop" signal. The replica waits 1 s before proceeding with R2, so that all robots remain synchronized. Waiting 1 s in other states serves the same purpose.
• R2 : Disperse. The replica disperses in the environment, while avoiding collisions with other robots and the walls. This behavior lasts 8 s.
• R3 : Receive Model Parameters. The replica reads new model parameters from its buffer (sent earlier by the PC). It waits 1 s before proceeding with R4.
• R4 : Send "Start" Signal. The replica sends a start signal to the PC to inform it that a trial is about to start. The replica waits 1 s before proceeding with R5.
• R5 : Execute Model. The replica moves within the swarm according to its model. This behavior lasts 7 s (the tracking data corresponds to the middle 5 s, see P4 ). The replica then goes back to R1.
The program running on the agents has the same structure as the replica program. However, in the states analogous to R1, R3, and R4, they simply wait 1 s rather than communicate with the PC. In the state corresponding to R2, they also execute the Disperse behavior. In the state corresponding to R5, they execute the agent's aggregation controller, rather than a model. Each iteration (loop) of the program for the PC, replicas and agents lasts 18 s.

Experimental Setup
As in simulation, we used a population size of 100 for classifiers (µ = 50, λ = 50). However, the model population size was reduced from 100 to 20 (µ = 10, λ = 10), to shorten the experimental time. We used 10 robots: 8 representing agents executing the original aggregation controller (Eq. (6)), and 2 representing replicas that executed models. This meant that in each trial, 2 models from the population could be simultaneously evaluated; consequently, each generation consisted of 20/2 = 10 trials.
The Turing Learning algorithm was implemented without any modification to the code used in simulation (except for model population size and observation time in each trial). We still let the model parameters evolve unboundedly (i.e., in R 4 ). However, as the speed of the physical robots is naturally bounded, we applied the hyperbolic tangent function (tanh x) on each model parameter, before sending a model to a replica. This bounded the parameters to (−1, 1) 4 , with −1 and 1 representing the maximum backwards and forwards wheel speeds, respectively.
The Turing Learning runs proceeded autonomously. In the following cases, however, human intervention was made: • The robots had been running continuously for 25 generations. All batteries were replaced.
• Hardware failure occurred on a robot, for example because of a lost battery connection or because the robot became stuck on the floor. Appropriate action was taken for the affected robot to restore its functionality.
• A replica lost its Bluetooth connection with the PC. The connection with both replicas was restarted.
• A robot indicated a low battery status through its LED after running for only a short time. That robot's battery was changed.
After an intervention, the ongoing generation was restarted, to limit the impact on the identification process.

Results
We conducted 10 runs of Turing Learning using the physical system. Each run lasted 100 generations, corresponding to 5 hours (excluding human intervention time). Video recordings of all runs can be found in the online supplementary materials [Li et al., 2016].

Analysis of Inferred Models
We will first investigate the quality of the models obtained. To select the 'best' model from each run, we post-evaluated all models of the final generation 5 times using all classifiers of that generation. The parameters of these models are shown in Fig. 17. The means (standard deviations) of the AEs in each parameter were: 0.08 (0.06), 0.01 (0.01), 0.05 (0.08), and 0.02 (0.04).
To investigate the effects of real-world conditions on the identification process, we performed 10 simulated runs of Turing Learning with the same setup as in the physical runs. Fig. 18 shows the evolutionary dynamics of the parameters of the inferred models (with the highest subjective fitness) in the physical and simulated runs. The dynamics show good correspondence. However, the convergence in the physical runs is somewhat less smooth than that in the simulated ones (e.g., see spikes in v 0 and v 1 ). In each generation of every run (physical and simulated), we computed the MAE of each model. We compared the error of the model with the highest subjective fitness with the average and lowest errors. The results are shown in Fig. 19. For both the physical and simulated model parameter value of model parameters Figure 17: The model parameters Turing Learning inferred from swarms of physical robots performing aggregation. The models are those with the highest subjective fitness in the 100 th generation of 10 runs. Dotted black lines indicate the ground truth, that is, the values of the parameters that the system is expected to learn.  runs, the subjectively best model (green) has an error in between the lowest error (blue) and the average error (red) in the majority of generations.
As we argued before (Section 4.3), in swarm systems, good agreement between local behaviors (e.g., controller parameters) may not guarantee similar global behaviors. For this reason, we performed 20 trials using 40 physical e- The set of initial configurations of the robots was common to both controllers.
As it was not necessary to extract the orientation of the robots, a red circular marker was attached to each robot so that its position can be extracted with higher accuracy in the offline analysis [Gauci et al., 2014b]. Fig. 20(a) shows the proportion of robots in the largest cluster 8 over time with the agent and model controllers. Fig. 20(b) shows the dispersion (as defined in Section 4.3) of the robots over time with the two controllers. The aggregation dynamics of the agents and and models show good correspondence. Fig. 21 shows a sequence of snapshots from a trial with 40 e-pucks executing the inferred model controller.
A video accompanying this paper shows the Turing Learning identification process of the models (in a particular run) both in simulation and on the physical system. Additionally, videos of all 20 post-evaluation trials with 40 e-pucks, are available in [Li et al., 2016].

Analysis of Generated Classifiers
When post-evaluating the classifiers generated in the physical runs of Turing Learning, we limited the number of candidate models to 100, in order to reduce the physical experimentation time. Each candidate model was randomly chosen with uniform distribution from [−1, 1] 4 . Fig. 22 shows the average decision accuracy of the best classifiers and classifier groups over the 10 runs. Similar to the results in simulation, the classifier groups obtained in the physical runs still have a high decision accuracy. The classifier group (archive) has a higher accuracy than that of the classifier group (subjective) and best classifier (objective). However, in contrast to simulation, the decision accuracy of the best classifier (subjective) and best classifier (archive) does not drop within 100 generations. This could be due to the noise present in the physical runs, which may have prevented the classifiers from getting over-specialized in the comparatively short time provided (100 generations).

Conclusion
This paper presented a new system identification method-Turing Learningthat can autonomously infer agent behavior from observations. To our knowledge, Turing Learning is the first system identification method that does not rely on any predefined metric to quantitatively gauge the difference between agents and learned models. This eliminates the need to choose a suitable metric and the bias that such metric may have on the obtained solutions. Through competitive and successive generation of models and classifiers, the system successfully learned two swarm behaviors: self-organized aggregation and object clustering. Both the model parameters, which were automatically inferred, and emergent global behaviors closely matched those of the original swarm system. In addition, simulation results showed that Turing Learning outperforms a metric-based system identification method (which differed by using the least square metric instead of classifiers) in terms of the obtained model accuracy. We also constructed a robust classifier group that, given an individual's motion data, can tell whether the individual is an original agent or not. Such a classifier group could be effective in detecting abnormal behavior, for example, when faults occur in some members of the swarm. These classifiers are automatically produced without the need to define a priori what constitutes abnormal behavior.
The Turing Learning method was further validated using a physical system. We applied it to automatically infer the aggregation behavior of an observed swarm of e-puck robots. The behavior was learned successfully, and the results obtained in the physical experiments showed good correspondence to those obtained in simulation. This shows the robustness of our method with respect to noise and uncertainties in the real world. To the best of our knowledge, this is also the first time that a system identification method was used to infer the behavior of a swarm of physical robots.
A scalability study showed that the interactions in a swarm can be characterized by the effects on a subset of agents. In other words, when learning swarm behaviors especially with large number of agents, instead of considering the motion of all the agents in the group, we could focus on a subset of agents. This becomes critical when the available data about agents in the swarm is limited. Our approach was proven to work even if using only the motion data of a single agent and replica, as the data from this agent implicitly contained enough information about the interactions in the swarm.
In the two case studies presented, the model was explicitly represented by a set of parameters. The inferred parameters could thus be compared against the ground truth, enabling us to objectively gauge the quality of inferred models in the two case studies as well as for 1000 randomly sampled behaviors. Although the search space for the models is relatively small, identifying the parameters is challenging as the input values are unknown (consequently, a metric-based evolutionary algorithm did not succeed in approximating the parameter values). Parameter estimation plays an important part in the modeling of biological systems, as the model structure is often assumed to be known [Gautrais et al., 2012]. In principle, Turing Learning could also infer the structure of the agent's control system. The results of learning the agent's angle of view showed that our method may even learn the morphology of the swarming agents.
In the future, we intend to use Turing Learning to infer behaviors exhibited in natural swarms, such as in shoals of fish or herds of land mammals.