A modified evolutionary reinforcement learning for multi-agent region protection with fewer defenders

Sun, Siqing; Dong, Huachao; Li, Tianbo

doi:10.1007/s40747-024-01385-4

A modified evolutionary reinforcement learning for multi-agent region protection with fewer defenders

Original Article
Open access
Published: 22 February 2024

Volume 10, pages 3727–3742, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

A modified evolutionary reinforcement learning for multi-agent region protection with fewer defenders

Download PDF

Siqing Sun^1,2,
Huachao Dong¹ &
Tianbo Li¹

627 Accesses
1 Citation
Explore all metrics

Abstract

Autonomous region protection is a significant research area in multi-agent systems, aiming to empower defenders in preventing intruders from accessing specific regions. This paper presents a Multi-agent Region Protection Environment (MRPE) featuring fewer defenders, defender damages, and intruder evasion strategies targeting defenders. MRPE poses challenges for traditional protection methods due to its high nonstationarity and limited interception time window. To surmount these hurdles, we modify evolutionary reinforcement learning, giving rise to the corresponding multi-agent region protection method (MRPM). MRPM amalgamates the merits of evolutionary algorithms and deep reinforcement learning, specifically leveraging Differential Evolution (DE) and Multi-Agent Deep Deterministic Policy Gradient (MADDPG). DE facilitates diverse sample exploration and overcomes sparse rewards, while MADDPG trains defenders and expedites the DE convergence process. Additionally, an elite selection strategy tailored for multi-agent systems is devised to enhance defender collaboration. The paper also presents ingenious designs for the fitness and reward functions to effectively drive policy optimizations. Finally, extensive numerical simulations are conducted to validate the effectiveness of MRPM.

Adversarial genetic programming for cyber security: a rising application domain where GP matters

Article 02 April 2020

Adversarial Evolutionary Learning with Distributed Spatial Coevolution

Coevolutionary Approach to Sequential Stackelberg Security Games

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Leveraging agent-based unmanned systems offers numerous advantages [1], especially in reducing the risk of human casualties and executing missions cost-effectively. As a typical task, the protection of specific regions or assets has garnered significant attention lately [2]. Lowe et al. [3] study predator cooperation to prevent prey from reaching their food source. Raboin et al. [4] appoint a team of unmanned surface vehicles to guard an asset in environment featuring hostile boats and civilian traffic. Wang et al. [5] collaboratively guide multiple missiles to prevent a ship from reaching a target. Meng et al. [6] achieve cooperative harbor protection using autonomous underwater vehicles. Yu et al. [7] dispatch drones to intercept hostile drones and safeguard a military base. However, the diverse range of practical applications in region protection research results in varying research priorities. For instance, some studies focus on addressing underactuated dynamics control in unmanned systems [8], while simplifying task requirements to point protection rather than safeguarding entire areas. Furthermore, previous research has often overlooked critical factors, such as the asymmetrical capabilities between defenders and intruders, as well as collision damages among allied agents. This limitation diminishes the practicality and real-world applicability of existing approaches. Thus, this paper presents a more practical multi-agent region protection environment (MRPE) featuring fewer defenders, defender damages, and intruder evasion strategies targeting defenders.

The key to successful protection lies in advanced autonomous decision-making capabilities [9,10,11], and there are three main types of decision-making techniques [12]: game-based [13], optimization-based [14], and learning-based [15]. Game-based methods employ game theory to model protection processes and find equilibriums like Nash or Stackelberg. They are effective at capturing precise solutions [16], but they often require prior knowledge of the opponent strategies [17], thus limiting their applicability in real-world scenarios [18]. Optimization-based methods focus on finding optimal actions for task planning problems [19, 20], which usually adopt evolutionary algorithms or other metaheuristics, such as differential evolution (DE) [21], firefly algorithms [22], particle swarm optimization [23]. Metaheuristics [24] are excellent in decision space exploration and stable convergence properties. However, they encounter challenges in real-time decision-making [25, 26]. Learning-based methods [27], specifically deep reinforcement learning [28], have shown impressive performance in diverse domains, including Atari video games, Go game, and StarCraft [29,30,31]. These methods do not rely on dynamic models or prior knowledge, show good application potential in region protection [32, 33]. In recent developments, deep reinforcement learning has yielded model-free, multi-dimensional, constrained, and stochastic optimization algorithms, which can make instant multi-dimensional and optimal policies [34]. However, the system in MRPE switches constantly due to agent damages, resulting in dynamic and multimodal rewards. Consequently, learning-based methods may encounter convergence challenges in MRPE. How to achieve rapid decision-making in a high nonstationary environment becomes a core issue of this research.

Fortunately, Evolutionary Reinforcement Learning (ERL) shows promise in addressing challenges encountered with optimization and learning-based methods [35]. ERL integrates evolutionary algorithms for global exploration and stable convergence with learning-based approaches to train agents for real-time decision-making. By combining evolutionary algorithms and Deep Deterministic Policy Gradient (DDPG) [36], Khadka and Tumer [37] propose an ERL method for a single agent. This ERL approach demonstrates significant performance improvements over DDPG across all benchmarks. Nugroho et al. [38] apply a ERL approach that combines genetic algorithms and DDPG for solving powered descent guidance landing problems. In their experiments, agents trained using ERL achieve the highest fitness scores in extensive Monte-Carlo simulations. However, the aforementioned ERL methods are primarily designed for single-agent decision-making tasks [39]. When extended to multi-agent cooperation scenarios, they encounter the challenge of effective collaboration [40]. Firstly, agents are unable to exchange experiences or share information with their counterparts during the training process. To address this limitation, Lowe et al. [3] developed a centralized training and decentralized execution framework, successfully extending DDPG to Multi-Agent DDPG (MADDPG). Owing to its outstanding performance, MADDPG has become one of the most prominent methods in multi-agent reinforcement learning [41, 42]. Unfortunately, there are scarcely any multi-agent ERL methods that embrace the excellent training framework of MADDPG. Therefore, in this work, we utilize MADDPG to address this research gap. Secondly, credit assignment for multiple agents is essential for effective coordination, involving the accurate decomposition of team objectives into individual rewards [43, 44]. It is notable that credit assignment is quite challenging, akin to attributing individual contributions in a football match based solely on scoring [45]. Furthermore, there is often a conflict of interest between agent teams and their members [46]. For example, a defender may receive a reward for approaching the closest intruders in MRPE. However, if all defenders prioritize the same intruder, the region may be left vulnerable to other intruders. Thus, we propose an Elite selection procedure to mitigate this conflict.

After determining the algorithm framework, the configuration of elements within ERL assumes a pivotal role in incentivizing policy optimizations [47], with a particular emphasis on fitness and reward functions. Training processes commonly confront challenges associated with sparse rewards, wherein agents receive reinforcement feedback or rewards only upon achieving specific milestones or completing predefined objectives. The scarcity of rewards can complicate credit assignment, making it challenging for agents to discern the impact of their actions on the overall task. As a consequence, designing precise instant rewards poses an enduring challenge in the realm of reinforcement learning. Fortunately, the incorporation of fitness within ERL provides a partial alleviation to these challenges. In the context of fitness, due to the stable convergence of evolutionary algorithms, it can be directly calculated based on sparse MRPE results, allowing for a more precise evaluation of team performance. Regarding the reward functions, existing studies in region protection [48,49,50,51] often utilize distance-based shape rewards, including positive intruder-region distances and negative defender-intruder distances. Additionally, they typically add a large value to the final reward when an episode is completed. These reward structures incentivize defenders to pursue intruders and prevent them from entering the region. However, in MRPE, the mainstream reward frequently mutates during operation, leading to significant challenges in training. Thus, we ingeniously redesign the reward functions from three aspects to improve performance, ultimately resulting in significantly enhanced success rates in extensive MRPE simulations.

In summary, this work has the following contributions:

1.
Developing a multi-agent region protection method (MRPM) through ERL modifications, significantly improving success rates of autonomous protections in high nonstationary environments with fewer defenders.
2.
Proposing an elite selection procedure in MRPM to resolve conflicts between a defender team and its members, thereby enhancing coordination in multi-agent systems.
3.
Designing fitness and reward functions for MRPM to effectively drive policy optimizations, and verifying the effectiveness of MRPM by numerical simulations.

The remainder of this paper is organized as follows. “A challenging multi-agent region protection scene” constructs the challenging region protection environment MRPE. “Protection method combined by DE and MADDPG” develops a multi-agent protection method MRPM, by combining evolutionary optimization and MADDPG. “Numerical simulations” conducts numerical simulations and comparisons, to substantiate the effectiveness of the proposed learning method. Conclusions are finally drawn, and future works are presented in “Conclusions and future works.

A challenging multi-agent region protection scene

Throughout the paper, $\Vert *\Vert $ is the 2-norm of a vector $*$, and $*^{\top } $ represents the transpose of a matrix $*$. $\rho _{i}^{t}$, $v_{i}^{t}$, and $a _{i}^{t}$ are coordinate, velocity, and acceleration of agent i in x- and y- directions at moment t, respectively. $ *_{i,j}^t$ denotes the relative value of $*$ between agents i and j at a temporal instant t.

In Fig. 1, we develop a more practical multi-agent region protection environment, namely MRPE, where two opponent agents are considered, i.e., defenders {def} and intruders {int}. The defender team comprises a total of $N_{\text {def}}$ agents, while the intruder team consists of $N_{\text {int}}$ agents. MRPE is confined within a square area with a side length of $b_c$. The primary objective of the defenders is to prevent the intruders from accessing a central circular region c, characterized by the coordinates $\rho _c$ and a radius of $d_c$. To ensure generality, at the beginning of MRPE, agents are randomly placed. Defenders are located at the center, while intruders are positioned at the edge of MRPE, i.e., $ 0< \Vert \rho _{i}^{0} -\rho _c \Vert< 0.3 b_c, i \in {\{\text {def}\}}, 0.8 b_c< \Vert \rho _{j}^{0} -\rho _c \Vert < b_c, j \in \{\text {int}\}$. The initial velocities and accelerations of all agents are set to zero, All the agents follow the kinetics and the constraints:

$$\begin{aligned} {\dot{\rho }}_i^t = v_i^t&,&{\dot{v}}_i^t = a_i^t, \\ \nonumber \left\| v_{i}^t \right\| \le v_{i}^{\max }&,&\left\| a_{i}^t \right\| \le a_{i}^{\max }. \end{aligned}$$

(1)

Damages are introduced to improve practicality. Specifically, the defenders are designed to be stronger than the intruders. An intruder is considered damaged if it collides with any other agent, whereas a defender is considered damaged only when it hits other companion. To keep track of agent states, a series of flags are introduced to record agent states. For example, $\text {Dam}_i = 1$ represents that agent i is damaged, and $\text {Dam}_i = 0$ otherwise. Once an agent is damaged ($\text {Dam}_i = 1$) or reaches the region c ($\text {Arr}_i = 1$), it is marked as done ($\text {Run}_i = 0$) and eliminated from MRPE.

An episode in the protection scenario begins with $t = 0$ and is completed when all the intruders are marked as done or when the given time T is exhausted. The mission of the defenders is to successfully protect the region. Now, it is necessary to provide the following definition first.

Definition 1

(Success or failure of the protection): The region protection are successful ($ \text {Succ} =1$), if and only if all the defenders $ \{ \text {def}\}$ are alive ($ \text {Run}_i = 1, \forall i\in \{ {\text {def}}\} $), and all the intruders $ \{ \textrm{int}\}$ are damaged ($\text {Dam}_j= 0, \forall j \in \{ \text {int}\}$) in the given time T. Otherwise, the protection is failed ($ \text {Succ} =0$).

According to Definition 1, the challenge of designing effective defense rules is more complex compared to intrusion rules. Consequently, this research utilizes control rules for intruders and introduces a learning-based protection approach for defenders. Essentially, this work signifies a confrontation between automatic intruders and autonomous defenders.

Intruder strategy: To achieve practical MRPM, a smart opponent is necessary. Instead of straight or random movements, we design an intruder strategy, including the region attraction and collision avoidance, so that intruders can swiftly advance toward the protection area while avoiding interception and collisions. Firstly, a neighbor set $ {{\textbf{Inei}}_i^ t} = \{ j|\left\| {\rho _{i,j}^{t} } \right\| \le {d}, j\in {\{ \text {def} \} \cup \{ \text {int} \}}, j\ne i \} $ is defined to collect the nearby agents, where $d > 0$ is a set distance constant. Secondly, as Eq. (2) shows, the region attraction $\textbf{F}_{ic}$ is executed, when $ {{\textbf{Inei}}_i^ t} $ is empty. Otherwise, the collision avoidance is proposed inspired by flocking controls [52, 53], which generates repulsive forces by an artificial potential function $\varphi _{ij}$.

$$\begin{aligned} \textbf{n}_{\textbf{ij} }= & {} {\rho _{i,j}^{t} }/\left\| {{\rho _{i,j}^{t} }} \right\| , {\textbf{v}_{\textbf{ij}}} = {v _{i,j}^{t} }/\left\| {v _{i,j}^{t} }\right\| ,\nonumber \\ {\varphi _{ij}}= & {} - \frac{1}{2} \cdot \left[ { \left( 1 - \left\| {\rho _{i,j}^{t} } \right\| /{d_{\text {nei}}} \right) ^2} +{\left( \min \left\{ \cos < {{\textbf{n}}_{{\textbf{ij}}}},\mathrm{{ }}{{\textbf{v}}_{{\textbf{ij}}}} >,0 \right\} \right) ^2} \right] ,\nonumber \\ {\textbf{F}}_{{\textbf{ic}}}= & {} \left( {{k_\rho } \cdot \rho _{i,c}^t + {k_v} \cdot v _{i,c}^t} \right) /{{\left\| {{{k_\rho } \cdot \rho _{i,c}^t +{k_v} \cdot v _{i,c}^t}} \right\| }},\nonumber \\ a_{i}^{t}= & {} \left\{ \begin{array}{l} a_{i}^{\max }{{\textbf{F}}_{{\textbf{ic}}}}\mathrm{{, if }}\ {{\textbf{Inei}}_i^ t} = \emptyset \\ \sum \limits _{j \in {{\textbf{Inei}}_i^ t} } {{\varphi _{ij}} \cdot {{\textbf{n}}_{{\textbf{ij}}}}}/{N_{\textbf{Inei}}} \mathrm{{, if }}\ {{\textbf{Inei}}_i^ t} \ne \emptyset \\ \end{array} \right. \end{aligned}$$

(2)

where $\textbf{n}_{ij}, \textbf{v}_{ij}$ are the unit vectors of relative positions and relative velocities between agents i and j. $\textbf{F}_{ic}$ is the unit vector of the attractive accelerate from the center area. $k_\rho , k_v $ are control parameters, set as 1 and 3.

Defender modeling: To introduce learning-based protection approaches for defenders, the protection process of defenders is approximated as a multi-agent extension of Markov decision processes [54], defined by a set of quintuples $(S, O_{i}, A_{i}, R_{i})$ to represent each defender. In MRPE, defenders are cooperative with partners and compete with the intruders. Thus, the global state space S describes the possible properties of all defenders. $O_{i}$ denotes the local observations of defender i, and each defender can receive the individual observation correlated with the state $S \longmapsto O_{i}$. $A_{i}$ represents the action space, which contains all actions of the agent. $S\times A_{i} \longmapsto R_{i}$ is a reward function assigning an extrinsic reward $r_{i}$ for taking an action $a_i$ under state $s_{i}$.

As shown in Fig. 1, defender i generate an action $a_i^t$ by its policy $\pi _i$ and the local observation $o_i^t, o_i^t \in O_i$ for an interaction with MRPE, i.e.,

$$\begin{aligned} a_i^t = \pi _i \left( o_i^t \right) . \end{aligned}$$

(3)

Thereby, defender i gets to next state $s_i^{t+1}$ with a reward $r_i^t$ for the next interaction with MRPE. A series of state transition quintuples $(s_i^t, a_i^t, r_i^t, s_i^{t+1})$ of defender i are stored into an experience replay buffer D. Based on the samplings from the replay buffer D, the goal of each defender is to optimize its policy $\pi _i$ for generating an action $a_i^t$ with the maximized long-term discounted cumulative reward $R_i^t$ [44], i.e.,

$$\begin{aligned} {R_i^t} = r_i^{t+1}+\gamma r_i^{t+2} +\gamma ^2 r_i^{t+3}+ \cdots = \mathop \sum \limits _{k \in [0, T-t]} \gamma ^k r_i^{t+k+1}, \end{aligned}$$

(4)

where $\gamma ( \in [0, 1])$ denotes a discount factor. To obtain the optimum policy $\pi _i^*$, the state value function ${E_\pi } _i (R_i^t | s_i)$ is adopted as the objective function, which is the expectation of the long-term discounted cumulative reward $R_i^t$ of defender i with the state $s_i$.

Now, we are ready to develop the main technical problem for this paper.

Problem 1: For each defender governed by Eq. (1), how to design an algorithm to obtain the optimal action policy $\pi _i^*$, by maximizing the expectation of the cumulative reward ${E_\pi } _i (R_i^t | s_i)$, i.e.,

$$\begin{aligned} \pi _i^* = \mathop {\textrm{argmax}} \limits _{a_i^t = {\pi }_i(o_i^t) } {{E_\pi } _i \left( R_i^t | s_i \right) }. \end{aligned}$$

(5)

It is notable that MRPE introduces several challenges in solving Problem 1. Firstly, defenders face the daunting task of protecting an entire area, which is often larger than typical scenes, while intruders only need to target a single point. This asymmetry in the protection task increases the difficulty for defenders. Secondly, the number of intruders exceeds that of defenders, and intruders possess avoidance abilities that make interception challenging. This leads to a limited interception time window for defenders, requiring the development of highly efficient policies $\pi _i$ to counter intruder. Finally, both defenders and intruders can suffer damages during the region protection process, adding a dynamic and nonstationary aspect to MRPE.

Protection method combined by DE and MADDPG

In this section, we propose a region protection method MRPM to address Problem 1 under the challenges. Firstly, we utilize an actor neural network ${\pi _\theta }_i(s_i)$, parameterized by $\theta _i$, to approximate a defender policy $\pi _i$. Thus, the optimization of $\theta _i$ plays a crucial role in effectively tackling Problem 1. To enhance this optimization, we combine two distinct algorithms: a gradient-free optimization technique called DE, and a gradient-based approach named MADDPG. DE is employed to generate a diverse set of $\theta _i$, helping overcome the weak convergence of MADDPG. Meanwhile, MADDPG updates $\theta _i$ and accelerates local optimization in DE. Considering the numerous parameters involved, Table. 1 provides separate introductions for parameters related to $\theta _i$ in DE and MADDPG for improved clarity. Subsequently, an Elite selection strategy is proposed for MRPM, to solve the goal conflict between the defender team and its members. Finally, specific reinforcement learning elements are designed to handle MRPE challenges, including the action space, the state space, fitness and reward function.

Table 1 Parameters related to $\theta _i$ in MRPM

Full size table

Elementary DE

Elementary DE consists of four processes, including Initialization, Fitness Evaluation, Mutation, and Crossover. DE is a population-based algorithm, which operates a population matrix $\textrm{POP}_g$ at every generation g, i.e., $\text {POP}_g = \{x_1^{\top }, \cdots , x_i^{\top }, \cdots , x_{n_p}^{\top }\}^{\top } $, which consists of $n_p$ individuals, and $x_i$ denotes the ith individual. Firstly, in Initialization, DE randomly generates the nascent generation $\text {POP}_0$ by

$$\begin{aligned} x_{i, j}=x_j^L+{\text {rand}}(0,1) \cdot \left( x_j^U-x_j^L\right) , \end{aligned}$$

(6)

where $x_{i, j}$ is the jth element of individual $x_i$. The symbols $x_j^U$ and $x_j^L$ denote the upper and the lower bounds of the jth element, respectively. Thereafter, Mutation and Crossover are adopted to generate a new population $\text {POP}_g$ at every generation. Specifically, three individuals $x_{r_1}$, $x_{r_2}$, and $x_{r_3}$ are randomly selected from $\textrm{POP}_{g-1}$ to generate mutations by

$$\begin{aligned} {{\widehat{x}}}_i = x_{r_3} + {F_s}(x_{r_2} - x_{r_1}), i \ne r_1 \ne r_2 \ne r_3 \in [1, n_p], \end{aligned}$$

(7)

where ${{\widehat{x}}}_i$ is the ith mutation individual, and $F_s (\in [0, 2])$ is the scaling factor. Subsequently, Crossover is employed to increase the diversity of solutions, where the original and the mutation individuals are chosen randomly according to the crossover possibility $C_r (\in [0, 1])$,

$$\begin{aligned} {{\widetilde{x}}}_i = \left\{ \begin{array}{l} {{\widehat{x}}}_i \ ({\text {rand}}_i \le C_r \cup i = {j_{\text {rand}}})\\ x_i \ ({\text {otherwise}}) \end{array} \right. , \end{aligned}$$

(8)

where ${{\widetilde{x}}}_i$ is the ith offspring individual, ${\text {rand}}_i (\in [0,1])$ is an uniformly distributed value, and $j_{\text {rand}} (\in [1,n_p])$ is an random integer value. Finally, Fitness Evaluation are carried out to select the best individual from the offspring.

Elementary MADDPG

MADDPG is a reinforcement learning method with an actor-critic frame. Specifically, since the cumulative reward $R_i^t$ in Eq. (4) is hard to solve, a critic neural network $Q_i^\pi (s, a)$ parameterized by $ \varepsilon _i$ is usually adopted to approach $R_i^t$ for each defender. Meanwhile, an actor network ${\pi _\theta }_i(o_i^t)$ parameterized by $\theta _i$ is trained to approximate the policy $\pi _i$. The actor network ${\pi _\theta }_i$ choose actions $a_i^t$ according to the local observation $o_i^t$, whereas the critic network $Q_i^\pi $ produce an approximate value ${\widehat{Q}}_i$ of $R_i^t$, so as to evaluate the action decided by the actor network. Besides, critic networks require co-agent information (states s and actions a) to handle the nonstationarity of MRPE.

To solve Problem 1, the policy gradient ${\nabla _{{\theta _i}}}J({\theta _i})$ [3] is firstly derived as:

$$\begin{aligned} {\nabla _{{\theta _i}}}J({\theta _i}) = E \left[ {\nabla _{{\theta _i}}}{{\pi _\theta } _i}({a_i}|{o_i}) {\nabla _{{{\pi _\theta } _i}}}Q_i^\pi \left( s,{a_{1,\ldots ,N}}{|_{{a_i} = {\pi _i}({o_i})}} \right) \right] .\nonumber \\ \end{aligned}$$

(9)

To approach the cumulative reward $R_i^t$, the goal of the critic network $Q_i^\pi $ is to minimize the sum of square loss $L({\varepsilon _i}) $, between the critic network output ${\widehat{Q}}_i$ and $R_i^t$, i.e., $L({\varepsilon _i}) =\mathop {\textrm{min}} \limits _{\epsilon _i }E[ ({\widehat{Q}}_i - R_i)^2 ] $. Thus, the gradient of critic networks ${\nabla _{{\varepsilon _i}}}L({\varepsilon _i})$ is calculated by

$$\begin{aligned} {\nabla _{{\varepsilon _i}}}L({\varepsilon _i}) = E\left[ \left( Q_i^\pi (s,{a_{1,\ldots ,Ni}}) - y \right) \cdot {\nabla _{{\varepsilon _i}}}Q_i^\pi \left( s,{a_{1,\ldots ,Ni}}\right) \right] ,\nonumber \\ \end{aligned}$$

(10)

where $ y = r_i^t+ \gamma Q{_i^{\pi ^ \prime } (s^{t+1},a^{t+1}_{1,\ldots ,N}|_{a^{t+1}_i = {\pi }_i'(o_i)})}$, which is an approximate value calculated by the temporal difference (TD), and $ Q_i^{\pi ^ \prime }, \pi _i ^ \prime $ represent the target networks with delayed parameters $ {\varepsilon _i} ^ \prime , {\theta _i}^ \prime $. Sampling transitions $(s_i^t, a_i^t, r_i^t, s_i^{t+1})$ from the replay buffer D, the network parameters $\theta _i, \varepsilon _i$ are iteratively updated by $\theta _{i}^{k+1}= \theta _{i}^k+\alpha _c {\nabla _{{\theta _i}}}J({\theta _i}), \varepsilon _{i}^{k+1}= \varepsilon _{i}^k-\alpha _c \nabla L({\varepsilon _{i}})$. Since MADDPG is a gradient optimization method, defender policies are apt to fall into local optima, failing to intercept all the intruders.

The frame of the proposed MRPM

MRPM frame is presented in Fig. 2, which integrates DE for global exploration and employs MADDPG to train defenders. Operators in DE, MADDPG, and MRPE are color-coded as blue, orange, and green. Besides, numerical labels indicate the execution sequence of the operators, and here is an overview of these operators:

Initialization: At the beginning, a DE population ${\textrm{POP}_g}$ are initialized at the progenitor ($g = 0$) by Eq. (6), where each individual $x_i$ in $\text {POP}_g$ represents the actor network parameters $\theta _i$ of a defender team, i.e., $x_i = \{\theta _1^{\top }, \theta _2^{\top }, \cdots ,\theta _i^{\top }\}$. Additionally, a recorder denoted as $x_{PG} = \{\theta _1^{\top }, \theta _2^{\top }, \cdots ,\theta _i^{\top }\}$ is randomly initialized, which stores the actor parameters of a defender team trained by MADDPG. Subsequently, generation loops is executed, comprising operators 1–8 in Fig. 2.

Interaction with MRPE (DE $\rightarrow $ MADDPG): In the blue operators 1–4, each defender team $x_i$ in $\text {POP}_g$ continuously interacts with MRPE in a whole episode for fitness evaluation. Detailed explanations regarding fitness evaluation will be provided in the following section. Simultaneously, during the blue operator 2, each interaction with MRPE generates a state transition $(s_i^t, a_i^t, r_i^t, s_i^{t+1})$, which is stored in the replay buffer D, as illustrated in the orange operator 3. Thus, by passing state transitions, this procedure can be regarded as an interface from DE to MADDPG.

MADDPG training: In orange operator 4, MADDPG updates $x_{PG}$ by sampling from the replay buffer D. This update is carried out using policy gradients defined in Eq. (9), specifically: $\theta _{i}^{k+1}= \theta _{i}^k+\alpha _c {\nabla _{{\theta _i}}}J({\theta _i})$, facilitating the training of defender teams.

Elite selection (MADDPG $\rightarrow $ DE): In the green operators 5–6, Elite selection is proposed to create an elite population $\text {POP}_g^e$ by combining $\text {POP}_g$ from DE and $x_{PG}$ from MADDPG. Specially, each individual $x_i$ in $\textrm{POP}_g$ is assessed based on two fitness values: team fitness $F_T$ and member fitness $F_m$. These fitness values capture the distinct objectives of defender teams and their individual members, providing essential information for the selection process. The individuals in $\text {POP}_g$ are ranked in descending order according to their team fitness $F_T$, forming a prime population $\text {POP}_g^e$. Additionally, a defender actor $\theta _i$ with the highest member fitness $F_m$, marked as $\text {Max}_i$, is selected to create a team with elite members $x_e$. This team replaces the penultimate individual $x_{-2}$ in $\text {POP}_g^e$. Finally, the worst individual $x_{-1}$ is replaced with the MADDPG team $x_{PG}$. The elite selection process completes $\text {POP}_g^e$, which is directly assigned to the new generation $\text {POP}_{g+1}$. Similarly, by passing $x_{PG}$ to $\text {POP}_{g+1}$, this procedure is an interface from MADDPG to DE.

Actor evolution: In the blue operators 7–8, a new actor population $\text {POP}_{g+1}$ is generated from $\text {POP}_g^e$ using DE operators as described in Eqs. (7, 8). Consequently, the actor population $\text {POP}_g$ evolves after generation loops.

Optimum team actor selection: Finally, the MADDPG recorder $x_{PG}$ is also continuously updated after generation loops The optimum team actor $x_i^*$ can be selected from $\textrm{POP}_g$ and $x_{PG}$.

The proposed elements in MRPM

To enhance the performance of MRPM, meticulous design of its elements is particularly important. These elements encompass the action space $A_i$, state $s_i^t$, the fitness measures ($F_T, F_m$), and the reward function $r_i^t$. The specific techniques are explained as follows.

Action space with policy ensembles: The design of the action space demands both comprehensiveness and efficiency. In accordance with the dynamic model specified in Eq. (1), the action space $A_i$ for defender i is defined to include five directions of acceleration: up $a_\uparrow $, down $a_\downarrow $, left $a_\leftarrow $, right $a_\rightarrow $, and stop $a_s$. Each acceleration value is bounded within the range [0, 1]. The final action $a_i^t$ is represented as $a_i^t = (a_\rightarrow - a_\leftarrow , a_\uparrow - a_\downarrow )$.

State space with two moment properties: It has been discovered that incorporating sequential states aids in decision-making. Consequently, the state space $s_i$ is defined based on the properties of the current and previous moments, represented as $s_i^t = [p_i^{t-1}, p_i^t]$. Each moment property $p_t$ includes the protection situations of defender i, denoted as $\chi _i^t = {[\rho _{i}^{t},v_{i}^{t},\text {Run}_i]}$, as well as the relative situations between defender i and other agents, denoted as $ \chi _{i,j}^t ={[\rho _{i,j}^t,v_{i,j}^t,\text {Run}_{i,j}^t]}, j \in {\{ \text {def}\} \cup \mathcal \{ \text {int}\}}, j \ne i$.

Fitness function: Fitness functions serve as essential criteria for optimizing actor parameters $\theta _i$, regulating each defender’s behavior to achieve a common task. DE is insensitive to sparse rewards, allowing the fitness to be obtained after completing an MRPE episode instead of at every protection moment. This allows the interception result to be used as the fitness, leading to more accurate optimization of the defender behavior. Specifically, a result is obtained after a Roll procedure, where a team policy $x_i \in \text {POP}_g$ continuously interacts with MRPE in an episode. These results can be classified into two scenarios: protection success and protection failure. For a more detailed recording, we consider metrics such as the number of damaged intruders $(N_{\text {int0}}-N_{\text {arr}})$, the count of surviving defenders $N_{{\text {def}1}}$, and the completion time $T_{c}$. These metrics collectively form the team fitness $F_T$ for performance evaluation, i.e.,

$$\begin{aligned}{} & {} F_T = \nonumber \\{} & {} \left\{ \begin{array}{l} ({N_{\text {int0}}-N_{\text {arr}}} ) \cdot {10^4} + {N_{{\text {def}1}}} \cdot {10^3} + ({T} - {T_{c}}),\ \text {if Succ =1}\\ ({N_{\text {int0}}-N_{\text {arr}}} ) \cdot {10^4} + {N_{{\text {def}1}}} \cdot {10^3} + T_{c},\ \text {if Succ} =0. \end{array} \right. \end{aligned}$$

(11)

where $ {N_{\text {int0}}, N_{\text {arr}}} $ are the number of done and arrived intruders, and thus, $ ({N_{\text {int0}}-N_{\text {arr}}} )$ is the number of damaged intruders. For successful protection, a shorter completion time is considered more favorable. Thus, the remaining time $({T} - {T_{c}})$ is selected as the metric. However, in the case of failed protection, a longer completion time is preferred $T_{c}$, as it indicates that the defender is exerting maximum effort to delay the intruders. Similarly, for a defender, let $n_{\text {int0}}$ be the total number of intruders intercepted by it, and $T_{\text {int}}$ be the time it intercepts the last intruder. Thus, the member fitness $F_m$ is calculated by $F_m = n_{\text {int0}}\times {10^4} + \text {Run}_i \times {10^3} + (T-T_{\text {int}})$, which encourage a defender to intercept more intruders with the least amount of time

Reward function: Real-time rewards are the only motivation for the learning. However, accurately decomposing the team protection performance into member rewards is challenging. To overcome this, an ingenious reward function for MADDPG is designed in Eq. (12) by three terms, to approximate the protection performance.

$$\begin{aligned} r_i^t = {r_{\text {usu}}}_i^t + {r_{\text {cha}}}_i^t+{r_{\text {rev}}}_i^t, \end{aligned}$$

(12)

where $ {r_{\text {usu}}}_{i}^{t}, {r_{\textrm{cha}}}_{i}^{t}, {r_{\text {rev}}}_i^t$ denote the usual, the change and the revise rewards, respectively. ${r_{\text {usu}}}_i^t$ encourages defenders to pursue intruders, whereas ${r_{\text {cha}}}_i^t$ enhances this promoting effect. ${r_{\text {rev}}}_{i}^{t}$ updates defender behavior according to interception results. All the terms cooperatively shape the reward $R_i$ to match task objectives, so that the actor network parameters $\theta _i$ can be prevented from falling into local optima.

The design of ${r_{\text {usu}}}_{i}^{t}$ follows a similar approach to the mainstream method [48,49,50,51], represented as:

$$\begin{aligned} {r_{\text {usu}}}_{i}^{t}{} & {} = -k_\alpha \left\| {\rho _{i,{{\tilde{i}}}}^t} \right\| + k_\alpha \left\| {\rho _{{{{\tilde{i}}}},c}^t} \right\| + { \sum \limits _{j \in \{\text {int}\}} {\left\| {\rho _{j,c}^t} \right\| } /N_{\text {int}}}\nonumber \\{} & {} \qquad - \sum \limits _{j \in {\textbf{nei}}_i^t} {{{\left( 1 - \left\| {\rho _{i,j}^t} \right\| /{{d}}\right) }^2}}, \end{aligned}$$

(13)

where ${{\tilde{i}}}$ is the nearest intruder of defender i, and ${{\tilde{i}}}$ is related to the time. An attention weight $k_\alpha $ is introduced to terms including ${{\tilde{i}}}$, which avoids defenders constantly swinging among different intruders. $ {\textbf{nei}}_i^t$ is the neighbor set of defender i, i.e., ${\textbf{nei}}_i^t = \{ j|\left\| {\rho _{i,j}^t} \right\| \le {{d}}\}, j \in \{ \text {def}\}$. In Eq. (13), the first term is a negative defender-intruder distance, prompting defenders to approach their nearest intruders. The second and the third terms are positive intruder-region distances, urging defenders to drive intruders out. Besides, the attention weight $k_\alpha $ is assigned to the second term, so as to improve the influence of the nearest intruder. The last team is proposed by defender-defender distances for collision avoidance, ensuring the safety navigation of defenders.

The change reward ${r_{\text {cha}}}_{i}^{t}$ is defined as the derivative of ${r_{\text {usu}}}_{i}^{t}$, which encourages agents to make progress and penalize them for retreating or moving away from their objectives. This reward enhances the promoting effect of ${r_{\text {usu}}}_{i}^{t}$, expressed as:

$$\begin{aligned} {r_{\text {cha}}}_{i}^{t} = {r_{\text {usu}}}_{i}^{t} - {r_{\text {usu}}}_{i}^{t-1}. \end{aligned}$$

(14)

However, there are certain events that cannot be adequately captured by ${r_{\text {usu}}}_{i}^{t}$ and ${r_{\text {cha}}}_{i}^{t}$, such as agent damage or protection success. Typically, the reward $r_{i}^{t}$ is adjusted directly by incorporating sharp increases or decreases, introducing high stationary. To mitigate the impact of dramatic changes, we introduce the revised reward ${r_{\text {rev}}}_{i}^{t}$, which decomposes the sharp value $S_v$ into a sequence of differences:

$$\begin{aligned} {r_{\text {rev}}}_{i}^{t}= S_v \cdot { \frac{{2 \Delta t^2}}{{{T_{e}}({T_{e}} + \Delta t)}}} \cdot t, \end{aligned}$$

(15)

where $T_{e}$ represents the moment of event happening, and $ \Delta t $ is the time interval. It is observed that larger absolute values are assigned to ${r_{\text {rev}}}_{i}^{t}$ sequences that are closer to the event. Besides, the sharp value $S_v$ is obtained according to different events: (a) $S_v= -T_{e}$, if the intruder ${{\tilde{i}}}$ arrives. $k_{\text {event}}= -T_{e}/N_{\text {int}}$, if other intruders arrives. (b) $S_v= 2T_{e}$, if the intruder ${{\tilde{i}}}$ is intercepted. (c) $S_v= -2T_{e}$, if the defender is damaged. (d) $S_v= -5T_{c}$, if Succ = 1, and $S_v=5T_{c}$, if Succ = 0.

Algorithm 1 outlines the proposed MRPM, encompassing steps 1–26 for the training process, with step 27 dedicated to the autonomous decision-making process in MRPE. The fitness and reward functions serve as the unique criteria in optimizing actor parameters $\theta _i$, and their design is meticulously tailored to the objective and outcomes of a defender team. Consequently, these functions enable the precise regulation of each agent’s behavior to collectively accomplish a shared task after training.

Numerical simulations

Numerical simulations are conducted to assess the effectiveness of MRPM. Specifically, we compare MRPM with other protection methods and variations in rewards across two different MRPE cases. MRPM-I is used for method comparisons, while MRPM-II focuses on variations in reward functions. Here are the details.

Parameter settings and evaluation criteria

In the training process, the parameters can be divided into three classes, including MRPE, MADDPG, and DE. The configuration of these parameters is crucial for ensuring convergence and effectiveness. For MRPE parameters, they can be selected according to the actual interception demand, such as the location and the size of the protected region. MRPE parameters are configured in Table 2, which is bounded within a coordinate system of range $[0, 200]^2$, featuring a defense circular region centered at $\rho _c = (100, 100)$ with a radius $d_c = 12$. Since there are more intruders than defenders, it is almost impossible for the defenders to successively intercept all attackers without the speed advantage. Thus, the maximum acceleration $a_i^{\max }$ of the intruders and the defenders are 2, 4, whereas the maximum velocities $v_i^{\max }$ are 6, 12. A time limit T of 30 s is set, allowing all intruders to reach their destination unimpeded by defenders. To assess performance, each method is evaluated on two different MRPE cases: one with 2 defenders versus 3 intruders, and the other with 3 defenders versus 5 intruders. Importantly, all parameters are normalized during agent state calculations to enhance learning performance.

Table 2 MRPE Configuration

Full size table

Table 3 Structure and parameter of actor and critic networks

Full size table

The selection of MADDPG parameters can be guided by the recommendations in [33]. Based on prior experiences, We set the discount factor $\gamma $ to 0.95, and the learning rates $\alpha _c$ for both the actor and critic networks are set to 0.02. The replay buffer size is $5 \times 10^{5}$, and the batch size of updating is 256. Besides, networks with 5 dense layers have demonstrated favorable performance in region protection problems. For networks employing 5 dense layers, several advantages can be delineated. Firstly, such networks demonstrate an enhanced capability to apprehend highly intricate and non-linear associations inherent within the dataset. Secondly, the increased depth of the network facilitates the acquisition of hierarchical and abstract representations of the input data, thereby enabling the nuanced capture of intricate relationships within the dataset. This architecture proves particularly adept at generating actions predicated on states or approximating cumulative rewards based on states and actions, as they both are complex representations of the input data. Conversely, notable disadvantages are discernible. The networks may require a large amount of labeled data for effective training. Notably, reinforcement learning mechanisms can serve as a mitigating factor, continuously sampling data for ongoing training efforts. However, the networks are computationally intensive during training. Training time and resource requirements can become substantial. These considerations may render the application of networks with 5 dense layers less practical for certain domains. Thus, the deep layers and properties of the actor and critic networks are provided in Table 3. In the actor networks, the inputs are the local observations $o_i^t$, and the outputs are the actions $a_i^t$, represented as $(a_\rightarrow - a_\leftarrow , a_\uparrow - a_\downarrow )$ In the critic networks, the inputs are composed of all the states s and actions a, and the output is an approximate value ${\widehat{Q}}_i$ for $R_i^t$.

Referring to [55], DE is configured with the following parameter settings: the scaling factor $F_s$: 0.48, the crossover probability $C_r$: 0.25 the population size $n_p$: 10. Finally, the maximum number of generations is set as 3.5 million to ensure convergence and stability.

For the evaluation criteria, we select the following metrics: success rates (SRs), fitness function $F_T$, and team reward function $R_T$. SRs is the most intuitive and critical criterion. To mitigate the impact of randomness, defenders trained by each method autonomously execute the mission in $1 \times 10^5$ different episodes for each case, and the number of successful protections is determined according to Definition 1. In addition to SRs, we consider more detailed metrics for data recording, including the number of damaged intruders $(N_{\text {int0}}-N_{\text {arr}})$, the count of surviving defenders $N_{{\text {def}1}}$, and the completion time $T_{c}$. Since these metrics collectively contribute to the fitness function $F_T$, $F_T$ is select as another criterion. Subsequently, to examine the learning effect, we introduce the team reward $R_T$, which is the accumulation of the true rewards of the defender team in one episode, calculated as follows:

$$\begin{aligned} R_T =\sum \limits _{ t \in T} \sum \limits _{ i \in \{ \textbf{def}\} }{r_i^t} \end{aligned}$$

(16)

Finally, it is expected that these three criteria will exhibit similar growing trends as methods become more effective in training.

Validation of intruder strategy

In MRPE, designing an intelligent intruder and its strategies is crucial. Instead of random actions, we have developed an intruder strategy that enables swift movement towards the protection area while avoiding interception and collisions. To demonstrate its effectiveness, we conduct simulations comparing the random intruder strategy with our proposed strategy, shown in Fig. 3. In these simulations, the number of defenders matches the number of intruders, and the defenders are stationary or exhibit inefficient protection movements. The inefficient defenders are consistently attracted toward the nearest intruder, with a maximum acceleration and velocity limited to 1 each.

In a series of ten simulations, the random intruders consistently fail to reach the region. In contrast, all intruders employing the proposed strategy safely arrived within less than 24 s when confronted by static or inefficient defenders. Specifically, Fig. 3 (1) depicts an example of random intruders facing static defenders. Despite the absence of disturbances to the intruders, they are unable to reach the region. In Fig. 3 (2), random intruders confront inefficient defenders. With the exception of one intruder leaving the MRPE area, the remaining intruders are nearly entirely damaged. Two collide, one is intercepted, and the sole survivor remains merely because the MRPE time is spent up, and it will soon be captured. Conversely, Fig. 3 (3) and (4) illustrate the outcomes of the proposed intruder strategy, where all intruders safely reach the region. These examples underscore the superiority of our intruder strategy, which greatly improves the protection difficulty.

MRPM-I

The MRPM-I experiment involved four methods: DE, MADDPG, MRPM-we (without elite selection), and the proposed MRPM. DE represents an optimization-based protection method. MADDPG is a well-established reinforcement learning algorithm for multi-agent protection and confrontation [32, 33, 49,50,51]. MRPM-we can be considered a standard ERL method. Notably, all the methods except DE incorporate policy gradients into their approaches. Additionally, all the methods except MADDPG incorporate DE into their approaches.

Table 4 presents SRs of the methods in $1 \times 10^5$ episodes for the two MRPE cases. Regardless of the case, the proposed MRPM achieves the highest SRs of 97.02% and 91.27%, outperforming the other methods. In contrast, DE cannot successfully protect the region even once, highlighting the challenge of the scenario. Furthermore, SRs of MADDPG in the two cases are 43.36% and 40.49%, while the SRs of MRPM-we are 68.31% and 72.26%, demonstrating the superiority of ERL. Notably, each method shows similar SRs across both cases, indicating that the methods are not highly sensitive to the number of agents.

Table 4 SRs of the methods in $1 \times 10^5$ episodes for the two MRPE cases

Full size table

To monitor the training process, the reward and fitness curves for these methods in the two cases are shown in Figs. 4 and 5. The methods are tested every 10 generations, and the average results of 10 experiments are reported. The learning curves suggest that defenders are still able to intercept some intruders and survive in most protection failure episodes. This demonstrates the difficulty of achieving successful protection, the potential for improvement through training, and the necessity of using both the fitness and the reward function as performance criteria.

With the exception of DE, all methods exhibit a gradual increase in both reward and fitness curves, which underscores the importance of policy gradients. DE initially shows a slight growth in rewards, but they plateau at around 0.4 after one million training generations. The fitness curves do not show significant improvement. This can be attributed to the challenging nature of MRPE and the high-dimensional nature of actor network parameters, where DE requires extensive exploration to find a better solution in the vast parameter space. Consequently, the fitness curves for DE remain at a consistently low level. DE can only delay intruders to some extent but struggles to intercept them with the current training generation, according to Eq. (11). Therefore, policy gradients indeed accelerate the DE optimization process in this problem. However, on the other hand, lacking DE makes that MADDPG performance worse than MRPM-we and MRPM, as DE plays an indispensable role in achieving diverse samples. Obviously, MADDPG gets trapped in local optima, as it only manages to intercept an average of about 2.3 intruders in case 1 and 4 intruders in case 2. Additionally, the shadow area of MADDPG are larger in the two cases, indicating higher instability. Thus, it becomes evident that the combination of DE and policy gradients is essential. MRPM outperforms MRPM-we in terms of final convergence reward and fitness value, showcasing the effectiveness of the proposed elite selection strategy.

MRPM-II

MRPM-II consists of four types of reward functions: MRPM-u, MRPM-ur, MRPM-uc, and the proposed MRPM. In MRPM-u, the mainstream reward design method is used. The reward values, denoted as ${r_{\text {usu}}}_{i}^{t}$, are calculated at each time step, and a sharp value $S_v$ in Eq. (15) is added when specific events occur. MRPM-ur removes the change reward ${r_{\text {cha}}}_{i}^{t}$ component from Eq. (12), while MRPM-uc removes the ${r_{\text {rev}}}_{i}^{t}$ term.

Similarly, Table 4 records SRs of MRPM-u, MRPM-ur, and MRPM-uc. MRPM-u fails to achieve success in both cases, underscoring the ineffectiveness of using a single reward term ${r_{\text {usu}}}_{i}^{t}$ without a hybrid approach. In case 1, MRPM-ur and MRPM-uc achieve SRs of 60.74% and 13.65%. However, as the number of agents increases in case 2, there is a notable decline in SRs, dropping to 3.21% and 0.40% for MRPM-ur and MRPM-uc, respectively. Particularly, MRPM-ur demonstrates relatively high SR in case 1 but struggles to succeed in case 2. The performance decline can be attributed to the lack of the revised reward ${r_{\text {rev}}}_{i}^{t}$, which shows increased significance in scenarios with a higher number of agents. In cases with fewer agents, the role of ${r_{\text {rev}}}_{i}^{t}$ is limited due to the reduced frequency of events. Finally, when compared with other MRPM-II methods, the proposed MRPM still achieves the highest SRs in both cases, highlighting the effectiveness of the fitness design.

The learning curves of these methods are also presented in Figs. 6 and 7. From the reward curves, it can be observed that MRPM, MRPM-ur, and MRPM-uc exhibit gradual improvement in finding better actor parameters. However, MRPM-u shows a constantly fluctuating reward curve with no significant improvement. Based on the fitness curves, MRPM-u’s training process appears highly ineffective as it manages to intercept an average of only 0–1 intruders in both cases. Consequently, MRPM-u fails to achieve protection success, consistent with the findings presented in Table 4. This could be attributed to the high nonstationarity of MRPE, where sharp values $S_v$ are frequently added to the rewards. Regarding the fitness curves, only MRPM successfully captures action networks near the global optimum, enabling it to intercept all intruders after 2.5 training generations. Besides, the reward and fitness curves of MRPM exhibit similar trends, demonstrating the good interpretability of the reward design. On the other hand, MRPM-ur and MRPM-uc become trapped in local optima and fail to achieve the same level of performance. For MRPM-ur, it can intercept 2–3 intruders in both cases. Although SRs of MRPM-ur decrease from case 1 to case 2, the number of interceptions remains almost the same. Thus, the interception capability do not change, and the decent SRs are mainly due to the increased number of intruders. For MRPM-uc, its performance lags significantly behind MRPM-ur, underscoring the stability and necessity of the change reward ${r_{\text {cha}}}{i}^{t}$, as it strengthens defender behaviors of pursuing intruders. Therefore, the absence of ${r\text {cha}}_{i}^{t}$ results in sparse rewards, leaving defenders uncertain about their actions. Ultimately, MRPM still achieve the best learning performance in MRPM-II, and thus, the proposed reward in Eq. (12), which is enhanced by the three terms, demonstrates its superiority.

Protection processes

Since case 2 is more complex than case 1, we illustrate representative MRPE processes of case 2 in Figs. 8 and 9, so as to offer a better understanding of the performance variations among each method. Figure 8 shows the MRPM-I protection process, where Fig. 8a–c are successful scenarios. In these three scenarios, intruder 6 is the closest to the region. Compared to the other methods, MRPM has the farthest distance between intruder 6 and the region. This outcome validates the high interception efficiency of MRPM, which can be attributed to the proposed elite selection strategy, enabling the rapid formation of a team comprising defenders with high fitness values.

A notable observation is that defenders in MRPM adopt a spiral chasing method, allowing them to maintain maximum speed for a longer duration. This approach results in relatively smooth trajectories with larger turning radii. In contrast, defenders in MADDPG tend to move directly towards intruders. However, after intercepting the first batch of intruders (5, 7, 8), defenders in MADDPG are required to slow down and turn around to pursue intruders 4 and 7, resulting in a waste of time. Consequently, the trajectories in MADDPG exhibit three sharp twists. Although the spiral chasing method employed by MRPM is highly efficient, it necessitates higher attack accuracy due to the higher velocities of the defenders, which are challenging to adjust in a short time. If an intruder successfully avoids an attack due to low accuracy, it results in significant time wastage in recapturing the escaping intruder. For instance, defender 2 in MRPM-we engages in a prolonged confrontation with intruders 4 and 5, further demonstrating the fast and accurate interception capabilities of MRPM, where intruders have limited chances to escape. Lastly, DE initially sends defenders to approach and delay intruders. However, the defenders lack the ability to turn around, indicating that the training generation is far from sufficient for DE. This further validates the necessity of policy gradients in achieving effective defense strategies.

Representative MRPM-II episodes are depicted in Fig. 9. In MRPM-uc, defenders consistently reduce the distances to intruders, resulting in the damage of intruders 5 and 7. However, due to the absence of event incentives, defenders fail to execute accurate and efficient actions for the final attack. As a result, intruders 4, 6, and 8 reach the region. In MRPM-ur, defenders patrol around the region, waiting to collide with intruders. However, they cannot swiftly approach intruders due to the lacks of the change reward. Consequently, intruders 6–8 successfully reach the region. In MRPM-u, defenders are capable of approaching intruders before the occurrence of turbulence caused by events. However, they may struggle to execute normal actions once an event takes place. For instance, defender 1 loses its interception ability after intruder 5 escapes. Defender 2 oscillates between intruders 4 and 8, failing to sustain pursuit once intruder 8 arrives. Defender 3 gets trapped in chaotic movements after colliding with intruder 6. As a result, MRPM-u is limited to environments with low levels of nonstationarity, where events have minimal impact on the effectiveness of defender actions. Finally, by comparing these variants, it becomes evident that MRPM exhibits superiority over alternative approaches.

Conclusions and future works

This paper firstly develops a multi-agent protection environment MRPE featuring fewer defenders, defender damages, and intruder evasion strategies targeting defenders. MRPE is designed to be more practical but also poses challenges for traditional protection methods due to its high nonstationarity and limited interception time window. To address these challenges, the corresponding protection method MRPM is proposed by combing DE and MADDPG. DE facilitates diverse sample exploration and overcomes sparse rewards, while MADDPG trains defenders and expedites the DE convergence process. Subsequently, an elite selection strategy tailored for multi-agent systems is devised to enhance defender collaboration. Besides, the fitness and reward functions are ingeniously designed for to effectively drive policy optimizations. Finally, extensive numerical simulations are conducted to assess the effectiveness of MRPM. These simulations encompass two MRPE cases and involve a comprehensive comparison between MRPM and other approaches, including MADDPG, DE, and MRPM without an elite selection strategy. The investigation also delves into the influence of different reward schemes on the outcomes. The results unequivocally highlight a substantial enhancement in collaborative defense success achieved by MRPM, when compared to the other considered methods.

Our proposed method aims at steering the autonomous decision-making of unmanned system swarms, notably unmanned surface vessels (USVs). However, it is worth acknowledging that the underactuation dynamics inherent to USVs may limit the feasibility of certain actions. To address this challenge, our future research will incorporate specific reward terms to handle the constraints of underactuation dynamics. Furthermore, our proposed method effectively handles scenarios involving a small number of agents. However, as the number of agents increases, the curse of dimensionality becomes a significant concern. In future research, we intend to develop an encoder method for state and action features to mitigate this dimensionality issue.

Data availability

Data will be made available on request.

References

Ning B, Han Q-L, Zuo Z, Ding L, Lu Q, Ge X (2023) Fixed-time and prescribed-time consensus control of multiagent systems and its applications: a survey of recent trends and methodologies. IEEE Trans Ind Inform 19(2):1121–1135. https://doi.org/10.1109/TII.2022.3201589
Article Google Scholar
Chen W, Gao C, Jing W (2023) Proximal policy optimization guidance algorithm for intercepting near-space maneuvering targets. Aerosp Sci Technol 132:108031
Article Google Scholar
Lowe R, Wu Y, Tamar A, Harb J, Pieter Abbeel O, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. Adv Neural Inf Process Syst 30:6382–6393
Google Scholar
Raboin E, Švec P, Nau DS, Gupta SK (2015) Model-predictive asset guarding by team of autonomous surface vehicles in environment with civilian boats. Auton Robot 38:261–282
Article Google Scholar
Wang X, Zhang Y, Wu H (2015) Distributed cooperative guidance of multiple anti-ship missiles with arbitrary impact angle constraint. Aerosp Sci Technol 46:299–311
Article Google Scholar
Meng X, Sun B, Zhu D (2021) Harbour protection: moving invasion target interception for multi-AUV based on prediction planning interception method. Ocean Eng 219:108268
Article Google Scholar
Yu Y, Liu J, Wei C (2022) Hawk and pigeon’s intelligence for UAV swarm dynamic combat game via competitive learning pigeon-inspired optimization. Sci China Technol Sci 65(5):1072–1086
Article Google Scholar
Sui S, Tong S (2023) Finite-time fuzzy adaptive PPC for nonstrict-feedback nonlinear MIMO systems. IEEE Trans Cybern 53(2):732–742. https://doi.org/10.1109/TCYB.2022.3163739
Article Google Scholar
Rizk Y, Awad M, Tunstel EW (2018) Decision making in multiagent systems: a survey. IEEE Trans Cognit Dev Syst 10(3):514–529
Article Google Scholar
Khemakhem F, Ellouzi H, Ltifi H, Ayed MB (2022) Agent-based intelligent decision support systems: a systematic review. IEEE Trans Cognit Dev Syst 14(1):20–34
Article Google Scholar
Sui S, Chen CLP, Tong S (2023) A novel full errors fixed-time control for constraint nonlinear systems. IEEE Trans Autom Control 68(4):2568–2575. https://doi.org/10.1109/TAC.2022.3200962
Article MathSciNet Google Scholar
Peng Z, Song X, Song S, Stojanovic V (2023) Hysteresis quantified control for switched reaction–diffusion systems and its application. Complex Intell Syst. https://doi.org/10.1007/s40747-023-01135-y
Article Google Scholar
Liu C, Sun S, Tao C, Shou Y, Xu B (2021) Sliding mode control of multi-agent system with application to UAV air combat. Comput Electr Eng 96:107491
Article Google Scholar
Duan H, Zhao J, Deng Y, Shi Y, Ding X (2020) Dynamic discrete pigeon-inspired optimization for multi-UAV cooperative search-attack mission planning. IEEE Trans Aerosp Electron Syst 57(1):706–720
Article Google Scholar
Wang B, Li S, Gao X, Xie T (2023) Weighted mean field reinforcement learning for large-scale UAV swarm confrontation. Appl Intell 53(5):5274–5289
Google Scholar
Duan T, Wang W, Wang T (2022) A review for unmanned swarm gaming: framework, model and algorithm. In: 2022 8th International conference on big data and information analytics (BigDIA), IEEE. pp 164–170
Antonioni E, Suriani V, Riccio F, Nardi D (2021) Game strategies for physical robot soccer players: a survey. IEEE Trans Games 13(4):342–357
Article Google Scholar
Liu F, Dong X, Yu J, Hua Y, Li Q, Ren Z (2022) Distributed Nash equilibrium seeking of $n$-coalition noncooperative games with application to UAV swarms. IEEE Trans Netw Sci Eng 9(4):2392–2405
Article MathSciNet Google Scholar
Stojanovic V, Nedic N (2016) A nature inspired parameter tuning approach to cascade control for hydraulically driven parallel robot platform. J Optim Theory Appl 168:332–347
Article MathSciNet Google Scholar
Chen C, Li Y, Cao G, Zhang J (2023) Research on dynamic scheduling model of plant protection UAV based on levy simulated annealing algorithm. Sustainability 15(3):1772
Article Google Scholar
Sun S, Song B, Wang P, Dong H, Chen X (2022) Real-time mission-motion planner for multi-UUVS cooperative work using tri-level programing. IEEE Trans Intell Transp Syst 23(2):1260–1273. https://doi.org/10.1109/TITS.2020.3023819
Article Google Scholar
Pršić D, Nedić N, Stojanović V (2017) A nature inspired optimal control of pneumatic-driven parallel robot platform. Proc Inst Mech Eng Part C J Mech Eng Sci 231(1):59–71
Article Google Scholar
Nedic N, Prsic D, Dubonjic L, Stojanovic V, Djordjevic V (2014) Optimal cascade hydraulic control for a parallel robot platform by PSO. Int J Adv Manuf Technol 72:1085–1098
Article Google Scholar
Lei Y, Huo M, Deng Y, Duan H (2022) Multiple UAVS target allocation via stochastic dominant learning pigeon-inspired optimization in beyond-visual-range air combat. In: 2022 12th International conference on CYBER technology in automation, control, and intelligent systems (CYBER), pp 1269–1274
Cai J, Zhang F, Sun S, Li T (2021) A meta-heuristic assisted underwater glider path planning method. Ocean Eng 242:110121
Article Google Scholar
Chen C, Wang X, Dong H, Wang P (2022) Surrogate-assisted hierarchical learning water cycle algorithm for high-dimensional expensive optimization. Swarm Evol Comput 75:101169
Article Google Scholar
Zhou C, Tao H, Chen Y, Stojanovic V, Paszke W (2022) Robust point-to-point iterative learning control for constrained systems: a minimum energy approach. Int J Robust Nonlinear Control 32(18):10139–10161
Article MathSciNet Google Scholar
Cheng P, Wang H, Stojanovic V, Liu F, He S, Shi K (2022) Dissipativity-based finite-time asynchronous output feedback control for wind turbine system via a hidden Markov model. Int J Syst Sci 53(15):3177–3189
Article MathSciNet Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359
Article Google Scholar
Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P et al (2019) Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575(7782):350–354
Article Google Scholar
Li Y, Han W, Wang Y (2020) Deep reinforcement learning with application to air confrontation intelligent decision-making of manned/unmanned aerial vehicle cooperative system. IEEE Access 8:67887–67898
Article Google Scholar
Zhang R, Zong Q, Zhang X, Dou L, Tian B (2023) Game of drones: Multi-UAV pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans Neural Netw Learn Syst 34(10):7900–7909. https://doi.org/10.1109/TNNLS.2022.3146976
Article Google Scholar
Tutsoy O (2022) Pharmacological, non-pharmacological policies and mutation: an artificial intelligence based multi-dimensional policy making algorithm for controlling the casualties of the pandemic diseases. IEEE Trans Pattern Anal Mach Intell 44(12):9477–9488. https://doi.org/10.1109/TPAMI.2021.3127674
Article Google Scholar
Colas C, Sigaud O, Oudeyer P-Y (2018) Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In: International conference on machine learning, PMLR. pp 1039–1048
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015)Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
Khadka S, Tumer K (2018) Evolution-guided policy gradient in reinforcement learning. In: Proceedings of the 32nd international conference on neural information processing systems. NIPS’18, Curran Associates Inc., Red Hook, NY, USA. pp. 1196–1208
Nugroho L, Andiarti R, Akmeliawati R, Kutay AT, Larasati DK, Wijaya SK (2023) Optimization of reward shaping function based on genetic algorithm applied to a cross validated deep deterministic policy gradient in a powered landing guidance problem. Eng Appl Artif Intell 120:105798
Article Google Scholar
Drugan MM (2019) Reinforcement learning versus evolutionary computation: a survey on hybrid algorithms. Swarm Evol Comput 44:228–246
Article Google Scholar
Khadka S, Majumdar S, Nassar T, Dwiel Z, Tumer E, Miret S, Liu Y, Tumer K (2019) Collaborative evolutionary reinforcement learning. In: International conference on machine learning, PMLR. pp 3341–3350
Wang Z, Liu F, Guo J, Hong C, Chen M, Wang E, Zhao Y (2022) UAV swarm confrontation based on multi-agent deep reinforcement learning. In: 2022 41st Chinese control conference (CCC), pp 4996–5001 . https://doi.org/10.23919/CCC55666.2022.9902663
Zhou H, Zhang X, Zhang Z (2022) Reinforcement learning technology for air combat confrontation of unmanned aerial vehicle. In: International conference on computer graphics, artificial intelligence, and data processing (ICCAID 2021), vol 12168. SPIE. pp 454–459
Oroojlooy A, Hajinezhad D (2022) A review of cooperative multi-agent deep reinforcement learning. Appl Intell 53:13677–13722
Singh B, Kumar R, Singh VP (2022) Reinforcement learning in robotic applications: a comprehensive survey. Artif Intell Rev 55:945–990
Article Google Scholar
Iqbal S, Sha F (2019) Actor-attention-critic for multi-agent reinforcement learning. In: International conference on machine learning, PMLR. pp 2961–2970
Majumdar S, Khadka S, Miret S, McAleer S, Tumer K (2020) Evolutionary reinforcement learning for sample-efficient multiagent coordination. In: International conference on machine learning, PMLR. pp. 6651–6660
Rupprecht T, Wang Y (2022) A survey for deep reinforcement learning in Markovian cyber-physical systems: common problems and solutions. Neural Netw 153:13–36
Article Google Scholar
Huang L, Fu M, Qu H, Wang S, Hu S (2021) A deep reinforcement learning-based method applied for solving multi-agent defense and attack problems. Expert Syst Appl 176:114896
Article Google Scholar
Wang B, Li S, Gao X, Xie T (2021) UAV swarm confrontation using hierarchical multiagent reinforcement learning. Int J Aerosp Eng 2021:1–12
Article Google Scholar
Xuan S, Ke L (2022) UAV swarm attack-defense confrontation based on multi-agent reinforcement learning. In: Advances in guidance, navigation and control: proceedings of 2020 international conference on guidance, navigation and control, ICGNC 2020, Tianjin, China, October 23–25, 2020. Springer, pp 5599–5608
Zhang T, Chai L, Wang S, Jin J, Liu X, Song A, Lan Y (2022) Improving autonomous behavior strategy learning in an unmanned swarm system through knowledge enhancement. IEEE Trans Reliab 71(2):763–774
Article Google Scholar
Olfati-Saber R (2006) Flocking for multi-agent dynamic systems: algorithms and theory. IEEE Trans Autom Control 51(3):401–420
Article MathSciNet Google Scholar
Tang C, Zhang H-T, Wang J (2023) Flexible formation tracking control of multiple unmanned surface vessels for navigating through narrow channels with unknown curvatures. IEEE Trans Ind Electron 70(3):2927–2938. https://doi.org/10.1109/TIE.2022.3169825
Article Google Scholar
Lauri M, Hsu D, Pajarinen J (2023) Partially observable Markov decision processes in robotics: a survey. IEEE Trans Robot 39(1):21–40
Article Google Scholar
Sun S, Song B, Wang P, Dong H, Chen X (2022) An adaptive bi-level task planning strategy for multi-USVS target visitation. Appl Soft Comput 115:108086
Article Google Scholar

Download references

Acknowledgements

This work is partially sponsored by the National Natural Science Foundation of China (Grant No. 52175251, 52205268), the National Basic Scientific Research Program of China (Grant No. JCKY2021206B005), and the authors are also grateful to members of the research group for the implementation of some existing heuristic algorithms.

Author information

Authors and Affiliations

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China
Siqing Sun, Huachao Dong & Tianbo Li
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
Siqing Sun

Authors

Siqing Sun
View author publications
You can also search for this author in PubMed Google Scholar
Huachao Dong
View author publications
You can also search for this author in PubMed Google Scholar
Tianbo Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huachao Dong.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, S., Dong, H. & Li, T. A modified evolutionary reinforcement learning for multi-agent region protection with fewer defenders. Complex Intell. Syst. 10, 3727–3742 (2024). https://doi.org/10.1007/s40747-024-01385-4

Download citation

Received: 29 June 2023
Accepted: 21 January 2024
Published: 22 February 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s40747-024-01385-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A modified evolutionary reinforcement learning for multi-agent region protection with fewer defenders

Abstract

Similar content being viewed by others

Adversarial genetic programming for cyber security: a rising application domain where GP matters

Adversarial Evolutionary Learning with Distributed Spatial Coevolution

Coevolutionary Approach to Sequential Stackelberg Security Games

Introduction