Behavior sequences
Given a collection of behaviors, the main problem under consideration in this paper is that of optimally selecting the sequence of such behaviors that best complete a given mission.
Definition 1
A coordinated behavior \({\mathcal {B}}\) is defined by the 5-tuple
$$\begin{aligned} {\mathcal {B}} = (w, {\varTheta}, v, {\varPhi}, G), \end{aligned}$$
(4)
where \({\varTheta}\) and \({\varPhi}\) are feasible sets for the parameters of the controller in (2). Moreover, G is the graph representing the interaction structure between the robots.
Given M distinct behaviors we compactly represent them as a library of behaviors, \({\mathcal {L}}\)
$$\begin{aligned} {\mathcal {L}}=\{{\mathcal {B}}_1, \dots , {\mathcal {B}}_M\}, \end{aligned}$$
(5)
where each behavior \({\mathcal {B}}_{k}\) is defined as in (4), i.e.,
$$\begin{aligned} {\mathcal {B}}_k = (w_k, {\varTheta}_k, v_k, {\varPhi}_k, G_k), ~~ k = 1,\dots , M. \end{aligned}$$
(6)
Here, note that the feasible sets \({\varTheta}\) and \({\varPhi}\), and the graph G may be different for different behaviors, that is, in switching between different behaviors the communication graphs of the robots may be time-varying. Moreover, based on Definition 1, it is important to note the difference between behavior and controller. The controller (2) executed by the robots for a given behavior is obtained by selecting a proper pair of parameters \((\theta ,\phi )\) from the sets \({\varTheta}\) and \({\varPhi}\). Indeed, consider a behavior \({\mathcal {B}}\) and let \(x_t = \big[x_{1,t}^{{\text {T}}}, \dots , x_{N,t}^{\mathrm{T}}\big]^{\mathrm{T}} \in {\mathbb {R}}^{2N}\) be the ensemble states of the robots at time t. In addition, let \(u_{\mathcal {B}}(x_t,\theta ,\phi )\), where \(u_{\mathcal {B}} = \big[u_{1}^{\mathrm{T}}, \dots , u_{N}^{\mathrm{T}}\big]^{\mathrm{T}} \in {\mathbb {R}}^{2N}\), be the controllers of the robots defined in (2) for a feasible pair of parameters \((\theta ,\phi )\). The ensemble dynamic of the robots associated with \({\mathcal {B}}\) is then given as
$$\begin{aligned} {\dot{x}}_t = u_{\mathcal {B}}(x_t,\theta ,\phi ). \end{aligned}$$
(7)
To further illustrate the difference between a behavior and its associated controller, we consider the following formation control example.
Example 1
Consider the formation control problem over a network of 4 robots moving in a plane, as illustrated in Fig. 1, where the desired inter-robot distances are given by a vector \(\theta =\{ \theta _{1},\dots ,\theta _{5}\}\), with \(\theta _{i}\in {\mathbb {R}}_+\). Here, robot 1 acts as a leader and moves toward the goal \(\phi \in {\mathbb {R}}^2\). It should be noted that the desired inter-robot distances also imply something about the interaction structure between robots (graph G) in that sufficient edges must be present in the graph for the formation to be possible, e.g., [28].
As the goal of the robots is to maintain the desired formation while moving to the goal location, one possible choice of the edge-weight function of the controller (2) is
$$\begin{aligned} w= \Vert x_i-x_j \Vert - \theta _k, ~~\forall \; e_k=(i,j)\in E, \end{aligned}$$
(8)
while the individual robot term is given by \(v_i=0,~i=2,3,4\), while the leader term is given by
$$\begin{aligned} v_1 = \phi - x_1. \end{aligned}$$
(9)
In this example, \({\varPhi}\) is simply a subset of \({\mathbb {R}}^{2}\) while \({\varTheta}\) is a set of geometrically feasible distances. Thus, given the formation control behavior \({\mathcal {B}}=(w, {\varTheta}, v, {\varPhi}, G)\), the controllers \(u_{\mathcal {B}}(x,\theta ,\phi )\) can be directly derived from (2).
We conclude this section with some additional comments about the formation control problem described in the previous example, where one can choose a single behavior \({\mathcal {B}}\in {\mathcal {L}}\) together with a pair of parameters \((\theta ,\phi )\) for solving the problem, e.g., [28]. This controller, however, is designed under the assumption that the environment is static and known, i.e., the target \(\phi\) in Example 1 is fixed and known by the robots. Such an assumption is less practical since in many applications, the robots are often operating in dynamically evolving and potentially unknown environments; for example, \(\phi\) is time-varying and unknown. On the other hand, while the formation control problem can be solved using a single behavior, many practical, complex tasks require the robots to support more than one behavior [9, 10]. Our interest, therefore, is to consider the problem of selecting a sequence of the behaviors in \({\mathcal {L}}\), while allowing for the state of the environment to be unknown and possibly time-varying.
Optimal behavior selection problems
In this section, we present the problem of optimal behavior selection over a network of robots, through the lens of reinforcement learning. In particular, consider a team of N robots cooperatively operating in an unknown environment and their goal is to complete a given mission over a time interval \([0,t_{f}]\).
Let \(x_{t}\) and \(e_{t}\) be the states of robots and environment at time \(t\in [0,t_{f}]\), respectively. At any time t, the robots first observe the state of the environment \(e_{t}\), select a behavior \({\mathcal {B}}_{t}\) chosen from the library \({\mathcal {L}}\), compute the pair of parameters \((\theta _{t},\phi _{t})\) associated with \({\mathcal {B}}_{t}\), and execute the resulting controller \(u_{{\mathcal {B}}}(x_{t},\theta _{t},\phi _{t})\). As a result of the robot actions, as well as the possibly dynamic nature of the environment, its state updates to a new value \(e'_{t}\) over a short sample time, and the robots get a reward returned by the environment based on the selected behavior and tuning parameters.
We here assume that these rewards are appropriately designed in that they encode the given mission, which is motivated by the usual consideration in the literature of reinforcement learning [12]. That is, solving the task is equivalent to maximizing the total accumulated rewards received by the robots. In Sect. 5, we provide a few examples of how to design such reward functions for particular applications. It is worth pointing out that designing these reward functions is itself challenging and requires sufficient knowledge about the underlying problem, as observed in [12].
One could try to solve the optimal behavior selection problem using existing reinforcement learning techniques. However, this problem is in general intractable since the dimension of state space is infinite, i.e., \(x_{t}\) and \(e_{t}\) are continuous variables. Moreover, due to the physical constraint of the robots, it is infeasible (and certainly impractical) for the robots to switch to a new behavior at every discrete time instant. That is, the robots require a finite amount of time to implement the controller of the selected behavior. Thus, to circumvent these issues we next consider an alternate version of the behavior selection problem.
Inspired by the work in [29], we introduce an interrupt condition \(\xi : {\mathcal {E}} \mapsto \{0,1\}\), where \({\mathcal {E}}\) is the “energy” in the network, which in turn is a measure of how well the individual task (not the complex mission) is being performed, as was the case in (3) when a negative gradient controller was produced. If \({\mathcal {E}}_t\) is the value of \({\mathcal {E}}\) at time t, then the interrupt condition is given by
$$\begin{aligned} \xi ({\mathcal {E}}_t) = 1 ~~\text {if} ~~{\mathcal {E}}_t \leqslant \varepsilon \end{aligned}$$
(10)
and \(\xi ({\mathcal {E}}_t)=0\) otherwise. Here \(\varepsilon\) is a small positive threshold. In other words, \(\xi ({\mathcal {E}}_t)\) represents a binary trigger with value 1 whenever the network energy for a certain behavior at time t is smaller than a threshold. Or, phrased slightly differently, the interrupt condition triggers when the individual task for which the controller was designed has nearly been completed. Thus, it is reasonable to insist that the robots should not switch to a new behavior at time t unless \(\xi ({\mathcal {E}}_t) = 1\) for a given \(\epsilon\).
Based on this observation, given a desired threshold \(\epsilon\), let \(\tau _{k}\) be the switching time associated with behavior \({\mathcal {B}}\), defined as
$$\begin{aligned} \tau _k({\mathcal {B}},\epsilon ,t_{0}) = \min \{ t \geqslant t_0 \, | \, {\mathcal {E}}_t \leqslant \varepsilon \}. \end{aligned}$$
(11)
Consequently, the mission time interval \([0,t_f]\) is partitioned into K switching times \(\tau _{0},\ldots ,\tau _{K}\) satisfying
$$\begin{aligned} 0 = \tau _{0} \leqslant \tau _{1}\leqslant \cdots \leqslant \tau _{K} = t_{f}, \end{aligned}$$
(12)
where each switching time, except \(\tau _0\) and \(\tau _K\), is defined as in (11). Note that the number of switching times, K, depends on the accuracy \(\epsilon\). In this paper, we do not consider the problem of how to select the appropriate threshold, \(\epsilon\).
We are now ready to describe, at a high level, how the behavior selection mechanism should operate. At each switching time \(\tau _{i}\), the robots choose a behavior \({\mathcal {B}}_{i}\in {\mathcal {L}}\) based on their current states, \(x_{\tau _{i}}\), and the environment state, \(e_{\tau _{i}}\). Next, they decide on a pair of parameters \((\theta _i,\phi _i)\) and implement the underlying controller \(u_{{\mathcal {B}}_i}(x_{t},\theta _i,\phi _i)\) for \(t\in [\tau _{i},\tau _{i+1})\). Based on the selected behaviors and parameters, the robots receive an instantaneous reward \({\mathcal {R}}(x_{\tau _{i}},e_{\tau _{i}},{\mathcal {B}}_{i},\theta _i,\phi _i)\) returned by the environment as a function of the selection.
Let J be the accumulative reward received by the robots at the switching times in \([0,t_{f}]\),
$$\begin{aligned} J = \textstyle \sum \limits _{i=0}^{K} {\mathcal {R}}(x_{\tau _{i}},e_{\tau _{i}},{\mathcal {B}}_{i},\theta _{\tau _{i}},\phi _{\tau _{i}}). \end{aligned}$$
(13)
As mentioned previously, the optimal behavior selection problem is cast as the problem of finding a sequence of behaviors \(\{{\mathcal {B}}_{i}\}\) from \({\mathcal {L}}\) at \(\{\tau _{i}\}\), and the associated parameters \(\{(\theta _i,\phi _i)\}\in \{{\varTheta}_i\times { \varPhi }_i\}\) so that the accumulative reward J is maximized. This optimization problem can be formulated as follows:
$$\begin{aligned} \begin{aligned} \underset{{\mathcal {B}}_i, \theta _i,\phi _i}{\text {maximize}}&~~\textstyle \sum \limits _{i=0}^{K} {\mathcal {R}}(x_{\tau _{i}},e_{\tau _{i}},{\mathcal {B}}_{i},\theta _i,\phi _i)\\ \text {such that}~&~~{\mathcal {B}}_i \in {\mathcal {L}}_b,\;(\theta _i,\phi _i) \in {\varTheta}_i\times {\varPhi}_i,\\&~~e_{t+1} = f_e(x_{t}, e_{t}),\\&~~{\dot{x}} = \,u_{{\mathcal {B}}_i}(x_{t},\theta _i,\phi _i),\; t\in [\tau _{i},\tau _{i+1}), \end{aligned} \end{aligned}$$
(14)
where \(f_e: {\mathbb {R}}^{2N} \times {\mathbb {R}}^{2} \mapsto {\mathbb {R}}^{2}\) is the unknown dynamic of the environment. Since \(f_e\) is unknown, one cannot use dynamic programming to solve this problem. Thus, in the next section we propose a novel method for solving (14), which is a combination of Q-learning and online gradient descent. Moreover, by introducing the switching times \(\tau _{i}\), computing the optimal sequence of behaviors using Q-learning is now a tractable problem.