## Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## 1 Introduction

Flocking or swarming in groups of social animals (birds, fish, ants, bees, etc.) that results in a particular global formation is an emergent collective behavior that continues to fascinate researchers [1, 7]. One would like to know if such a formation serves a higher purpose, and, if so, what that purpose is.

One well-studied flight-formation behavior is V-formation. Most of the work in this area has concentrated on devising simple dynamical rules that, when followed by each bird, eventually stabilize the flock to the desired V-formation [11, 12, 26]. This approach, however, does not shed very much light on the overall purpose of this emergent behavior.

In previous work [35, 36], we hypothesized that flying in V-formation is nothing but an optimal policy for a flocking-based Markov Decision Process (MDP) $$\mathcal {M}$$. States of $$\mathcal {M}$$, at discrete time t, are of the form $$({\varvec{x}}_i(t),{\varvec{v}}_i(t))$$, $$1\,{\leqslant }\,i \,{\leqslant }\,N$$, where $${\varvec{x}}_i(t)$$ and $${\varvec{v}}_i(t)$$ are N-vectors (for an N-bird flock) of 2-dimensional positions and velocities, respectively. $$\mathcal {M}$$’s transition relation, shown here for bird i is simply and generically given by

\begin{aligned} {\varvec{x}}_i(t + 1)= & {} {\varvec{x}}_i(t) + {\varvec{v}}_i(t+1),\\ {\varvec{v}}_i(t + 1)= & {} {\varvec{v}}_i(t) + {\varvec{a}}_i(t), \end{aligned}

where $${\varvec{a}}_i(t)$$ is an action, a 2-dimensional acceleration in this case, that bird i can take at time t. $$\mathcal {M}$$’s cost function reflects the energy-conservation, velocity-alignment and clear-view benefits enjoyed by a state of $$\mathcal {M}$$ (see Sect. 2).

In this paper, we not only confirm this hypothesis, but we also devise a very general adaptive, receding-horizon synthesis algorithm (ARES) that, given an MDP and one of its initial states, generates an optimal plan (action sequence) taking that state to a state whose cost is below a desired threshold. In fact, ARES implicitly defines an optimal, online-policy, synthesis algorithm that could be used in practice if plan generation can be performed in real-time.

ARES makes repeated use of Particle Swarm Optimization (PSO) [23] to effectively generate a plan. This was in principle unnecessary, as one could generate an optimal plan by calling PSO only once, with a maximum plan-length horizon. Such an approach, however, is in most cases impractical, as every unfolding of the MDP adds a number of new dimensions to the search space. Consequently, to obtain an adequate coverage of this space, one needs a very large number of particles, a number that is either going to exhaust available memory or require a prohibitive amount of time to find an optimal plan.

A simple solution to this problem would be to use a short horizon, typically of size two or three. This is indeed the current practice in Model Predictive Control (MPC) [13]. This approach, however, has at least three major drawbacks. First, and most importantly, it does not guarantee convergence and optimality, as one may oscillate or become stuck in a local optimum. Second, in some of the steps, the window size is unnecessarily large thereby negatively impacting performance. Third, in other steps, the window size may be not large enough to guide the optimizer out of a local minimum (see Fig. 1 (left)). One would therefore like to find the proper window size adaptively, but the question is how one can do it.

Inspired by Importance Splitting (IS), a sequential Monte-Carlo technique for estimating the probability of rare events, we introduce the notion of a level-based horizon (see Fig. 1 (right)). Level $$\ell _0$$ is the cost of the initial state, and level $$\ell _m$$ is the desired threshold. By using a state function, asymptotically converging to the desired threshold, we can determine a sequence of levels, ensuring convergence of ARES towards the desired optimal state(s) having a cost below $$\ell _m\,{=}\,\varphi$$.

The levels serve two purposes. First, they implicitly define a Lyapunov function, which guarantees convergence. If desired, this function can be explicitly generated for all states, up to some topological equivalence. Second, the levels help PSO overcome local minima (see Fig. 1 (left)). If reaching a next level requires PSO to temporarily pass over a state-cost ridge, ARES incrementally increases the size of the horizon, up to a maximum length.

Another idea imported from IS is to maintain n clones of the initial state at a time, and run PSO on each of them (see Fig. 3). This allows us to call PSO for each clone and desired horizon, with a very small number of particles per clone. Clones that do not reach the next level are discarded, and the successful ones are resampled. The number of particles is increased if no clone reaches a next level, for all horizons chosen. Once this happens, we reset the horizon to one, and repeat the process. In this way, we adaptively focus our resources on escaping from local minima. At the last level, we choose the optimal particle (a V-formation in case of flocking) and traverse its predecessors to find a plan.

We assess the rate of success in generating optimal plans in form of an $$(\varepsilon ,\delta )$$-approximation scheme, for a desired error margin $$\varepsilon$$, and confidence ratio $$1{-}\delta$$. Moreover, we can use the state-action pairs generated during the assessment (and possibly some additional new plans) to construct an explicit (tabled) optimal policy, modulo some topological equivalence. Given enough memory, one can use this policy in real time, as it only requires a table look-up.

To experimentally validate our approach, we have applied ARES to the problem of V-formation in bird flocking (with a deterministic MDP). The cost function to be optimized is defined as a weighted sum of the (flock-wide) clear-view, velocity-alignment, and upwash-benefit metrics. Clear view and velocity alignment are more or less obvious goals. Upwash optimizes energy savings. By flapping its wings, a bird generates a trailing upwash region off its wing tips; by using this upwash, a bird flying in this region (left or right) can save energy. Note that by requiring that at most one bird does not feel its effect, upwash can be used to define an analog version of a connected graph.

We ran ARES on 8,000 initial states chosen uniformly and at random, such that they are packed closely enough to feel upwash, but not too close to collide. We succeeded to generate a V-formation 95% of the time, with an error margin of 0.05 and a confidence ratio of 0.99. These error margin and confidence ratio dramatically improve if we consider all generated states and the fact that each state within a plan is independent from the states in all other plans.

The rest of this paper is organized as follows. Section 2 reviews our work on bird flocking and V-formation, and defines the manner in which we measure the cost of a flock (formation). Section 3 revisits the swarm optimization algorithm used in this paper, and Sect. 4 examines the main characteristics of importance splitting. Section 5 states the definition of the problem we are trying to solve. Section 6 introduces ARES, our adaptive receding-horizon synthesis algorithm for optimal plans, and discusses how we can extend this algorithm to explicitly generate policies. Section 7 measures the efficiency of ARES in terms of an $$(\varepsilon ,\delta )$$-approximation scheme. Section 8 compares our algorithm to related work, and Sect. 9 draws our conclusions and discusses future work.

## 2 V-Formation MDP

We represent a flock of birds as a dynamically evolving system. Every bird in our model [16] moves in 2-dimensional space performing acceleration actions determined by a global controller. Let $${\varvec{x}}_i(t), {\varvec{v}}_i(t)$$ and $${\varvec{a}}_i(t)$$ be 2-dimensional vectors of positions, velocities, and accelerations, respectively, of bird i at time t, where $$i\,{\in }\,\{1,\ldots ,b\}$$, for a fixed b. The discrete-time behavior of bird i is then

\begin{aligned} {\varvec{x}}_i(t + 1)&= {\varvec{x}}_i(t) + {\varvec{v}}_i(t + 1),\nonumber \\ {\varvec{v}}_i(t + 1)&= {\varvec{v}}_i (t)+ {\varvec{a}}_i(t). \end{aligned}
(1)

The controller detects the positions and velocities of all birds through sensors, and uses this information to compute an optimal acceleration for the entire flock. A bird uses its own component of the solution to update its velocity and position.

We extend this discrete-time dynamical model to a (deterministic) MDP by adding a cost (fitness) functionFootnote 1 based on the following metrics inspired by [35]:

• Clear View ($${ CV}$$). A bird’s visual field is a cone with angle $$\theta$$ that can be blocked by the wings of other birds. We define the clear-view metric by accumulating the percentage of a bird’s visual field that is blocked by other birds. Figure 2 (left) illustrates the calculation of the clear-view metric. The optimal value in a V-formation is $${ CV}^*{=}\,0$$, as all birds have a clear view.

• Velocity Matching ($${ VM}$$). The accumulated differences between the velocity of each bird and all other birds, summed up over all birds in the flock defines $${ VM}$$. Figure 2 (middle) depicts the values of $${ VM}$$ in a velocity-unmatched flock. The optimal value in a V-formation is $${ VM}^*{=}\,0$$, as all birds will have the same velocity (thus maintaining the V-formation).

• Upwash Benefit ($${ UB}$$). The trailing upwash is generated near the wingtips of a bird, while downwash is generated near the center of a bird. We accumulate all birds’ upwash benefits using a Gaussian-like model of the upwash and downwash region, as shown in Fig. 2 (right) for the right wing. The maximum upwash a bird can obtain has an upper bound of 1. For bird i with $${ UB}_i$$, we use $$1\,{-}\,{ UB}_i$$ as its upwash-benefit metric, because the optimization algorithm performs minimization of the fitness metrics. The optimal value in a V-formation is $${ UB}^*\,{=}\,1$$, as the leader does not receive any upwash.

Finding smooth and continuous formulations of the fitness metrics is a key element of solving optimization problems. The PSO algorithm has a very low probability of finding an optimal solution if the fitness metric is not well-designed.

Let $$\varvec{c}(t)\,{=}\,\{\varvec{c}_i(t)\}_{i=1}^b\,{=}\,\{{\varvec{x}}_i(t), {\varvec{v}}_i(t)\}_{i=1}^b\,{\in }\,\mathbb {R}$$ be a flock configuration at time-step t. Given the above metrics, the overall fitness (cost) metric J is of a sum-of-squares combination of $${ VM}$$, $${ CV}$$, and $${ UB}$$ defined as follows:

\begin{aligned} J(\varvec{c}(t),{\varvec{a}}^h(t),{h}) = ({ CV}(\varvec{c}_{{\varvec{a}}}^{h}(t))-{ CV}^*)^2&+ ({ VM}(\varvec{c}_{{\varvec{a}}}^{h}(t))-{ VM}^*)^2 \nonumber \\ {}&+({ UB}(\varvec{c}_{{\varvec{a}}}^{h}(t))-{ UB}^*)^2, \end{aligned}
(2)

where h is the receding prediction horizon (RPH), $${\varvec{a}}^h(t)\,{\in }\,\mathbb {R}$$ is a sequence of accelerations of length h, and $$\varvec{c}_{{\varvec{a}}}^{h}(t)$$ is the configuration reached after applying $${\varvec{a}}^h(t)$$ to $$\varvec{c}(t)$$. Formally, we have

\begin{aligned} \varvec{c}_{{\varvec{a}}}^{h}(t)= \{{\varvec{x}}_{{\varvec{a}}}^{h}(t), {\varvec{v}}_{{\varvec{a}}}^{h}(t)\} = \{{\varvec{x}}(t)+\sum _{\tau =1}^{{h}(t)}{\varvec{v}}(t+\tau ), {\varvec{v}}(t)+\sum _{\tau =1}^{{h}(t)} {\varvec{a}}^\tau (t) \}, \end{aligned}
(3)

where $${\varvec{a}}^\tau (t)$$ is the $$\tau$$th acceleration of $${\varvec{a}}^h(t)$$. A novelty of this paper is that, as described in Sect. 6, we allow RPH h(t) to be adaptive in nature.

The fitness function J has an optimal value of 0 in a perfect V-formation. The main goal of ARES is to compute the sequence of acceleration actions that lead the flock from a random initial configuration towards a controlled V-formation characterized by optimal fitness in order to conserve energy during flight including optimal combination of a clear visual field along with visibility of lateral neighbors. Similar to the centralized version of the approach given in [35], ARES performs a single flock-wide minimization of J at each time-step t to obtain an optimal plan of length h of acceleration actions:

\begin{aligned}&\mathbf {opt-}{\varvec{a}}^{h}(t)=\{\mathbf {opt-}{\varvec{a}}_i^{h}(t)\}_{i=1}^{b}=\mathop {\mathrm {arg\,min}}_{{\varvec{a}}^h(t)}J(\varvec{c}(t),{\varvec{a}}^h(t),{h}). \end{aligned}
(4)

The optimization is subject to the following constraints on the maximum velocities and accelerations: $$||{\varvec{v}}_i(t)||\,{\leqslant }\,{\varvec{v}}_{max}, ||{\varvec{a}}^h_i(t)||\,{\leqslant }\,\rho ||{\varvec{v}}_i(t)||\,\forall \,i\,{\in }\,\{1,\ldots ,b\}$$, where $${\varvec{v}}_{max}$$ is a constant and $$\rho \,{\in }\,(0,1)$$. The above constraints prevent us from using mixed-integer programming, we might, however, compare our solution to other continuous optimization techniques in the future. The initial positions and velocities of each bird are selected at random within certain ranges, and limited such that the distance between any two birds is greater than a (collision) constant $$d_{min}$$, and small enough for all birds, except for at most one, to feel the $${ UB}$$. In the following sections, we demonstrate how to generate optimal plans taking the initial state to a stable state with optimal fitness.

## 3 Particle Swarm Optimization

Particle Swarm Optimization (PSO) is a randomized approximation algorithm for computing the value of a parameter minimizing a possibly nonlinear cost (fitness) function. Interestingly, PSO itself is inspired by bird flocking [23]. Hence, PSO assumes that it works with a flock of birds.

Note, however, that in our running example, these birds are “acceleration birds” (or particles), and not the actual birds in the flock. Each bird has the same goal, finding food (reward), but none of them knows the location of the food. However, every bird knows the distance (horizon) to the food location. PSO works by moving each bird preferentially toward the bird closest to food.

ARES uses Matlab-Toolbox $$\texttt {particleswarm}$$, which performs the classical version of PSO. This PSO creates a swarm of particles, of size say p, uniformly at random within a given bound on their positions and velocities. Note that in our example, each particle represents itself a flock of bird-acceleration sequences $$\{{\varvec{a}}_i^{{h}}\}_{i=1}^b$$, where h is the current length of the receding horizon. PSO further chooses a neighborhood of a random size for each particle j, $$j\,{=}\,\{1,\ldots ,p\}$$, and computes the fitness of each particle. Based on the fitness values, PSO stores two vectors for j: its so-far personal-best position $$\mathbf {x}_{P}^j(t)$$, and its fittest neighbor’s position $$\mathbf {x}_{G}^j(t)$$. The positions and velocities of each particle j in the particle swarm $$1\,{\leqslant }\,j\,{\leqslant }\,p$$ are updated according to the following rule:

\begin{aligned} \mathbf {v}^j(t+1) = \omega \cdot \mathbf {v}^j(t)&+ y_1\cdot \mathbf {u_1}(t+1)\otimes (\mathbf {x}_{P}^j(t)-\mathbf {x}^j(t)) \nonumber \\&+ y_2\cdot \mathbf {u_2}(t+1)\otimes (\mathbf {x}_{G}^j(t)-\mathbf {x}^j(t)), \end{aligned}
(5)

where $$\omega$$ is inertia weight, which determines the trade-off between global and local exploration of the swarm (the value of $$\omega$$ is proportional to the exploration range); $$y_1$$ and $$y_2$$ are self adjustment and social adjustment, respectively; $$\mathbf {u_1},\mathbf {u_2}\,{\in }\,\mathrm{Uniform}(0,1)$$ are randomization factors; and $$\otimes$$ is the vector dot product, that is, $$\forall$$ random vector $$\mathbf {z}{:}\ (\mathbf {z}_1,\ldots ,\mathbf {z}_b)\otimes (\mathbf {x}_1^j,\ldots ,\mathbf {x}_b^j)=(\mathbf {z}_1\mathbf {x}_1^j,\ldots ,\mathbf {z}_b\mathbf {x}_b^j)$$.

If the fitness value for $$\mathbf {x}^j(t+1)\,{=}\,\mathbf {x}^j(t)\,{+}\,\mathbf {v}^j(t+1)$$ is lower than the one for $$\mathbf {x}_{P}^j(t)$$, then $$\mathbf {x}^j(t+1)$$ is assigned to $$\mathbf {x}_{P}^j(t+1)$$. The particle with the best fitness over the whole swarm becomes a global best for the next iteration. The procedure is repeated until the number of iterations reaches its maximum, the time elapses, or the minimum criteria is satisfied. For our bird-flock example we obtain in this way the best acceleration.

## 4 Importance Splitting

Importance Splitting (IS) is a sequential Monte-Carlo approximation technique for estimating the probability of rare events in a Markov process [21]. The algorithm uses a sequence $$S_0, S_1, S_2, \ldots , S_m$$ of sets of states (of increasing “importance”) such that $$S_0$$ is the set of initial states and $$S_m$$ is the set of states defining the rare event. The probability p, computed as $$\mathbf {P}(S_m\,|\,S_0)$$ of reaching $$S_m$$ from the initial set of states $$S_0$$, is assumed to be extremely low (thus, a rare event), and one desires to estimate this probability [15]. Random sampling approaches, such as the additive-error approximation algorithm described in Sect. 7, are bound to fail (are intractable) in this case, as they would require an enormous number of samples to estimate p with low-variance.

Importance splitting is a way of decomposing the estimation of p. In IS, the sequence $$S_0,S_1,\ldots$$ of sets of states is defined so that the conditional probabilities $$p_i\,{=}\,\mathbf {P}(S_i\,|\,S_{i-1})$$ of going from one level, $$S_{i-1}$$, to the next one, $$S_i$$, are considerably larger than p, and essentially equal to one another. The resulting probability of the rare event is then calculated as the product $$p\,{=}\,\prod _{i=1}^{k}p_i$$ of the intermediate probabilities. The levels can be defined adaptively [22].

To estimate $$p_i$$, IS uses a swarm of particles of size N, with a given initial distribution over the states of the stochastic process. During stage i of the algorithm, each particle starts at level $$S_{i-1}$$ and traverses the states of the stochastic process, checking if it reaches $$S_i$$. If, at the end of the stage, the particle fails to reach $$S_i$$, the particle is discarded. Suppose that $$K_i$$ particles survive. In this case, $$p_i\,{=}\,K_i{/}N$$. Before starting the next stage, the surviving particles are resampled, such that IS once again has N particles. Whereas IS is used for estimating probability of a rare event in a Markov process, we use it here for synthesizing a plan for a controllable Markov process, by combining it with ideas from controller synthesis (receding-horizon control) and nonlinear optimization (PSO).

## 5 Problem Definition

### Definition 1

A Markov decision process (MDP) $$\mathcal {M}$$ is a sequential decision problem that consists of a set of states S (with an initial state $$s_0$$), a set of actions A, a transition model T, and a cost function J. An MDP is deterministic if for each state and action, $$T\,{:}\,S\,{\times }\,{A}\,{\rightarrow }\,{S}$$ specifies a unique state.

### Definition 2

The optimal plan synthesis problem for an MDP $$\mathcal {M}$$, an arbitrary initial state $$s_0$$ of $$\mathcal {M}$$, and a threshold $$\varphi$$ is to synthesize a sequence of actions $${\varvec{a}}^{i}$$ of length $$1\,{\leqslant }\,i\,{\leqslant }\,m$$ taking $$s_0$$ to a state $$s^{*}$$ such that cost $$J(s^{*})\,{\leqslant }\,\varphi$$.

Section 6 presents our adaptive receding-horizon synthesis algorithm (ARES) for the optimal plan synthesis problem. In our flocking example (Sect. 2), ARES is used to synthesize a sequence of acceleration-actions bringing an arbitrary bird flock $$s_0$$ to an optimal state of V-formation $$s^{*}$$. We assume that we can easily extend such an optimal plan to maintain the cost of successor states below $$\varphi$$ ad infinitum (optimal stability).

## 6 The ARES Algorithm for Plan Synthesis

As mentioned in Sect. 1, one could in principle solve the optimization problem defined in Sect. 5 by calling the PSO only once, with a horizon h in $$\mathcal {M}$$ equaling the maximum length m allowed for a plan. This approach, however, tends to explode the search space, and is therefore in most cases intractable. Indeed, preliminary experiments with this technique applied to our running example could not generate any convergent plan.

A more tractable approach is to make repeated calls to PSO with a small horizon length h. The question is how small h can be. The current practice in model-predictive control (MPC) is to use a fixed h, $$1\,{\leqslant }\,h\,{\leqslant }\,3$$ (see the outer loop of Fig. 3, where resampling and conditional branches are disregarded). Unfortunately, this forces the selection of locally-optimal plans (of size less than three) in each call, and there is no guarantee of convergence when joining them together. In fact, in our running example, we were able to find plans leading to a V-formation in only $$45\%$$ of the time for 10, 000 random initial flocks.

Inspired by IS (see Figs. 1 (right) and 3), we introduce the notion of a level-based horizon, where level $$\ell _0$$ equals the cost of the initial state, and level $$\ell _m$$ equals the threshold $$\varphi$$. Intuitively, by using an asymptotic cost-convergence function ranging from $$\ell _0$$ to $$\ell _{m}$$, and dividing its graph in m equal segments, we can determine on the vertical axis a sequence of levels ensuring convergence.

The asymptotic function ARES implements is essentially $$\ell _{i}\,{=}\,\ell _0\,(m-i){/}\,m$$, but specifically tuned for each particle. Formally, if particle k has previously reached level equaling $$J_k(s_{i-1})$$, then its next target level is within the distance $$\varDelta _k\,{=}\,J_k(s_{i-1}){/}(m\,{-}\,i\,{+}\,1)$$. In Fig. 3, after passing the thresholds assigned to them, values of the cost function in the current state $$s_i$$ are sorted in ascending order $$\{\widehat{J}_{k}\}_{k=1}^n$$. The lowest cost $$\widehat{J}_1$$ should be apart from the previous level $$\ell _{i-1}$$ at least on its $$\varDelta _1$$ for the algorithm to proceed to the next level $$\ell _i\,{:=}\,\widehat{J}_1$$.

The levels serve two purposes. First, they implicitly define a Lyapunov function, which guarantees convergence. If desired, this function can be explicitly generated for all states, up to some topological equivalence. Second, the levels $$\ell _{i}$$ help PSO overcome local minima (see Fig. 1 (left)). If reaching a next level requires PSO to temporarily pass over a state-cost ridge, then ARES incrementally increases the size of the horizon h, up to a maximum size $$h_{max}$$. For particle k, passing the thresholds $$\varDelta _k$$ means that it reaches a new level, and the definition of $$\varDelta _k$$ ensures a smooth degradation of its threshold.

Another idea imported from IS and shown in Fig. 3, is to maintain n clones $$\{\mathcal {M}_k\}_{k=1}^n$$ of the MDP $$\mathcal {M}$$ (and its initial state) at any time t, and run PSO, for a horizon h, on each h-unfolding $$\mathcal {M}^h_k$$ of them. This results in an action sequence $${\varvec{a}}^{h}_k$$ of length h (see Algorithm 1). This approach allows us to call PSO for each clone and desired horizon, with a very small number of particles p per clone.

To check which particles have overcome their associated thresholds, we sort the particles according to their current cost, and split them in two sets: the successful set, having the indexes $$\mathcal {I}$$ and whose costs are lower than the median among all clones; and the unsuccessful set with indexes in $$\{1,{\ldots },n\}\,{\setminus }\mathcal {I}$$, which are discarded. The unsuccessful ones are further replenished, by sampling uniformly at random from the successful set $$\mathcal {I}$$ (see Algorithm 2).

The number of particles is increased $$p\,{=}\,p\,{+}\,p_{inc}$$ if no clone reaches a next level, for all horizons chosen. Once this happens, we reset the horizon to one, and repeat the process. In this way, we adaptively focus our resources on escaping from local minima. From the last level, we choose the state $$s^{*}$$ with the minimal cost, and traverse all of its predecessor states to find an optimal plan comprised of actions $$\{{\varvec{a}}^i\}_{1\leqslant i\leqslant m}$$ that led MDP $$\mathcal {M}$$ to the optimal state $$s^*$$. In our running example, we select a flock in V-formation, and traverse all its predecessor flocks. The overall procedure of ARES is shown in Algorithm 3.

### Proposition 1

(Optimality and Minimality). (1) Let $$\mathcal {M}$$ be an MDP. For any initial state $$s_0$$ of $$\mathcal {M}$$, ARES is able to solve the optimal-plan synthesis problem for $$\mathcal {M}$$ and $$s_0$$. (2) An optimal choice of m in function $$\varDelta _k$$, for some particle k, ensures that ARES also generates the shortest optimal plan.

### Proof

(Sketch). (1) The dynamic-threshold function $$\varDelta _k$$ ensures that the initial cost in $$s_0$$ is continuously decreased until it falls below $$\varphi$$. Moreover, for an appropriate number of clones, by adaptively determining the horizon and the number of particles needed to overcome $$\varDelta _k$$, ARES always converges, with probability 1, to an optimal state, given enough time and memory. (2) This follows from convergence property (1), and from the fact that ARES always gives preference to the shortest horizon while trying to overcome $$\varDelta _k$$.

The optimality referred to in the title of the paper is in the sense of (1). One, however, can do even better than (1), in the sense of (2), by empirically determining parameter m in the dynamic-threshold function $$\varDelta _k$$. Also note that ARES is an approximation algorithm. As a consequence, it might return nonminimal plans. Even in these circumstances, however, the plans will still lead to an optimal state. This is a V-formation in our flocking example.

## 7 Experimental Results

To assess the performance of our approach, we developed a simple simulation environment in Matlab. All experiments were run on an Intel Core i7-5820K CPU with 3.30 GHz and with 32 GB RAM available.

We performed numerous experiments with a varying number of birds. Unless stated otherwise, results refer to 8,000 experiments with 7 birds with the following parameters: $${p}_{start}\,{=}\,10$$, $${p}_{inc}\,{=}\,5$$, $${p}_{max}\,{=}\,40$$, $$\ell _{max}\,{=}\,20$$, $${h}_{max}\,{=}\,5$$, $$\varphi \,{=}\,10^{-3}$$, and $$n\,{=}\,20$$. The initial configurations were generated independently uniformly at random subject to the following constraints:

1. 1.

Position constraints: $$\forall \,i\,{\in }\,\{1,{\ldots },7\}.\,{\varvec{x}}_i(0)\in [0,3]\times [0,3]$$.

2. 2.

Velocity constraints: $$\forall \,i\,{\in }\,\{1,{\ldots },7\}.\,{\varvec{v}}_i(0)\in [0.25,0.75]\times [0.25,0.75]$$.

Table 1 gives an overview of the results with respect to the 8,000experiments we performed with 7 birds for a maximum of 20 levels. The average fitness across all experiments is at 0.0282 with a standard deviation of 0.1654. We achieved a success rate of $$94.66\%$$ with fitness threshold $$\varphi =10^{-3}$$. The average fitness is higher than the threshold due to comparably high fitness of unsuccessful experiments. When increasing the bound for the maximal plan length m to 30 we achieved a $$98.4\%$$ success rate in 1,000 experiments at the expense of a slightly longer average execution time.

The left plot in Fig. 5 depicts the resulting distribution of execution times for 8,000runs of our algorithm, where it is clear that, excluding only a few outliers from the histogram, an arbitrary configuration of birds (Fig. 4 (left)) reaches V-formation (Fig. 4 (right)) in around 1 min. The execution time rises with the number of birds as shown in Table 2.

In Fig. 5, we illustrate for how many experiments the algorithm had to increase RPH h (Fig. 5 (middle)) and the number of particles used by PSO p (Fig. 5 (right)) to improve time and space exploration, respectively.

After achieving such a high success rate of ARES for an arbitrary initial configuration, we would like to demonstrate that the number of experiments performed is sufficient for high confidence in our results. This requires us to determine the appropriate number N of random variables $$Z_1, ... Z_N$$ necessary for the Monte-Carlo approximation scheme we apply to assess efficiency of our approach. For this purpose, we use the additive approximation algorithm as discussed in [16]. If the sample mean $$\mu _Z\,{=}\,(Z_1\,{+}\,{\ldots }\,{+}\,Z_N)/N$$ is expected to be large, then one can exploit the Bernstein’s inequality and fix N to $$\Upsilon \,{\propto }\,ln(1/\delta )/\varepsilon ^2$$. This results in an additive or absolute-error $$(\varepsilon ,\delta )$$ -approximation scheme:

\begin{aligned} \mathbf{P}[\mu _Z\,{-}\,\varepsilon \le \widetilde{\mu }_Z\le \mu _Z\,{+}\,\varepsilon ]\ge {}1-\delta , \end{aligned}

where $$\widetilde{\mu }_Z$$ approximates $$\mu _Z$$ with absolute error $$\varepsilon$$ and probability $$1-\delta$$.

In particular, we are interested in Z being a Bernoulli random variable:

\begin{aligned} Z=\left\{ \begin{array}{ll} 1, &{} \text {if}\,\,J(\varvec{c}(t),{\varvec{a}}(t),{h}(t))\leqslant \varphi ,\\ 0, &{} \text {otherwise}. \end{array}\right. \end{aligned}

Therefore, we can use the Chernoff-Hoeffding instantiation of the Bernstein’s inequality, and further fix the proportionality constant to $$\Upsilon \,{=}\,4\,ln(2/\delta )/\varepsilon ^2$$, as in [19]. Hence, for our performed 8,000 experiments, we achieve a success rate of 95% with absolute error of $$\varepsilon = 0.05$$ and confidence ratio 0.99.

Moreover, considering that the average length of a plan is 13, and that each state in a plan is independent from all other plans, we can roughly consider that our above estimation generated 80,000 independent states. For the same confidence ratio of 0.99 we then obtain an approximation error $$\varepsilon \,{=}\,0.016$$, and for a confidence ratio of 0.999, we obtain an approximation error $$\varepsilon \,{=}\,0.019$$.

## 8 Related Work

Organized flight in flocks of birds can be categorized in cluster flocking and line formation [18]. In cluster flocking the individual birds in a large flock seem to be uncoordinated in general. However, the flock moves, turns, and wheels as if it were one organism. In 1987 Reynolds [27] defined his three famous rules describing separation, alignment, and cohesion for individual birds in order to have them flock together. This work has been great inspiration for research in the area of collective behavior and self-organization.

In contrast, line formation flight requires the individual birds to fly in a very specific formation. Line formation has two main benefits for the long-distance migrating birds. First, exploiting the generated uplift by birds flying in front, trailing birds are able to conserve energy [9, 24, 34]. Second, in a staggered formation, all birds have a clear view in front as well as a view on their neighbors [1]. While there has been quite some effort to keep a certain formation for multiple entities when traveling together [10, 14, 30], only little work deals with a task of achieving this extremely important formation from a random starting configuration [6]. The convergence of bird flocking into V-formation has been also analyzed with the use of combinatorial techniques [7].

Compared to previous work, in [5] this question is addressed without using any behavioral rules but as problem of optimal control. In [35] a cost function was proposed that reflects all major features of V-formation, namely, Clear View (CV), Velocity Matching (VM), and Upwash Benefit (UB). The technique of MPC is used to achieve V-formation starting from an arbitrary initial configuration of n birds. MPC solves the task by minimizing a functional defined as squared distance from the optimal values of CV, VM, and UB, subject to constraints on input and output. The approach is to choose an optimal velocity adjustment, as a control input, at each time-step applied to the velocity of each bird by predicting model behavior several time-steps ahead.

The controller synthesis problem has been widely studied [33]. The most popular and natural technique is Dynamic Programming (DP) [4] that improves the approximation of the functional at each iteration, eventually converging to the optimal one given a fixed asymptotic error. Compared to DP, which considers all the possible states of the system and might suffer from state-space explosion in case of environmental uncertainties, approximate algorithms [2, 3, 17, 25, 31, 32] take into account only the paths leading to desired target. One of the most efficient ones is Particle Swarm Optimization (PSO) [23] that has been adopted for finding the next best step of MPC in [35]. Although it is a very powerful optimization technique, it has not yet been possible to achieve a high success rate in solving the considered flocking problem. Sequential Monte-Carlo methods proved to be efficient in tackling the question of control for linear stochastic systems [8], in particular, Importance Splitting (IS) [22]. The approach we propose is, however, the first attempt to combine adaptive IS, PSO, and receding-horizon technique for synthesis of optimal plans for controllable systems. We use MPC to synthesize a plan, but use IS to determine the intermediate fitness-based waypoints. We use PSO to solve the multi-step optimization problem generated by MPC, but choose the planning horizon and the number of particles adaptively. These choices are governed by the difficulty to reach the next level.

## 9 Conclusion and Future Work

In this paper, we have presented ARES, a very general adaptive, receding-horizon synthesis algorithm for MDP-based optimal plans. Additionally, ARES can be readily converted into a model-predictive controller with an adaptive receding horizon and statistical guarantees of convergence. We also conducted a very thorough performance analysis of ARES based on the problem of V-formation in a flock of birds. For flocks of 7 birds, with high confidence ARES is able to generate an optimal plan leading to a V-formation in 95% of the 8,000 random initial configurations we considered, with an average execution time of only 63 s per plan.

The execution time of the ARES algorithm can be improved even further. First, we currently do not parallelize our implementation of the PSO algorithm. Recent work [20, 29, 37] has shown how Graphic Processing Units (GPUs) are very efficient at accelerating PSO computation. Modern GPUs, by providing thousands of cores, are well-suited for implementing PSO as they enable execution of a very large number of particles in parallel. Together with the parallelization of the fitness function calculation, this should significantly speed up our simulations and improve accuracy of the optimization procedure.

Second, we are currently using a static approach to decide how to increase our prediction horizon and the number of particles used in PSO. Specifically, we first increase the prediction horizon from 1 to 5, while keeping the number of particles unchanged at 10; if this fails to find a solution with fitness $$\widehat{J_1}$$ satisfying $$\ell _{i-1}\,{-}\,\widehat{J_1}>\varDelta _1$$, we then increase the number of particles by 5. Based on our results, we speculate that in the initial stages, increasing the prediction horizon is more beneficial (leading rapidly to the appearance of cost-effective formations), whereas in the later stages, increasing the number of particles is more helpful. As future work, we will use machine-learning approaches to decide on the value of above parameters at runtime given the current level and state of the MDP, as well as study the impact of different level decomposition. Moreover, in our approach, we calculate the number of clones for resampling based on the current state. An alternative approach would rely on statistics built up over multiple levels along with the rank in the sorted list to chose configurations for resampling.

Finally, we are currently using our approach to generate plans for a flock to go from an initial configuration to a final V-formation. Our eventual goal is to achieve formation flight for a robotic swarm of (bird-like) drones. A real-world example is parcel-delivering drones that follow the same route to their destinations. Letting them fly together for a while could save energy and increase flight time. To achieve this goal, we first need to investigate the wind dynamics of multi-rotor drones. Then, the fitness function needs to be adopted to the new wind dynamics. Lastly, a decentralized approach of this method needs to be implemented and tested on the drone firmware, as well as various attacking modes are to be analyzed for proving the resilience of the approach.