Introduction

Leveraging agent-based unmanned systems offers numerous advantages [1], especially in reducing the risk of human casualties and executing missions cost-effectively. As a typical task, the protection of specific regions or assets has garnered significant attention lately [2]. Lowe et al. [3] study predator cooperation to prevent prey from reaching their food source. Raboin et al. [4] appoint a team of unmanned surface vehicles to guard an asset in environment featuring hostile boats and civilian traffic. Wang et al. [5] collaboratively guide multiple missiles to prevent a ship from reaching a target. Meng et al. [6] achieve cooperative harbor protection using autonomous underwater vehicles. Yu et al. [7] dispatch drones to intercept hostile drones and safeguard a military base. However, the diverse range of practical applications in region protection research results in varying research priorities. For instance, some studies focus on addressing underactuated dynamics control in unmanned systems [8], while simplifying task requirements to point protection rather than safeguarding entire areas. Furthermore, previous research has often overlooked critical factors, such as the asymmetrical capabilities between defenders and intruders, as well as collision damages among allied agents. This limitation diminishes the practicality and real-world applicability of existing approaches. Thus, this paper presents a more practical multi-agent region protection environment (MRPE) featuring fewer defenders, defender damages, and intruder evasion strategies targeting defenders.

The key to successful protection lies in advanced autonomous decision-making capabilities [9,10,11], and there are three main types of decision-making techniques [12]: game-based [13], optimization-based [14], and learning-based [15]. Game-based methods employ game theory to model protection processes and find equilibriums like Nash or Stackelberg. They are effective at capturing precise solutions [16], but they often require prior knowledge of the opponent strategies [17], thus limiting their applicability in real-world scenarios [18]. Optimization-based methods focus on finding optimal actions for task planning problems [19, 20], which usually adopt evolutionary algorithms or other metaheuristics, such as differential evolution (DE) [21], firefly algorithms [22], particle swarm optimization [23]. Metaheuristics [24] are excellent in decision space exploration and stable convergence properties. However, they encounter challenges in real-time decision-making [25, 26]. Learning-based methods [27], specifically deep reinforcement learning [28], have shown impressive performance in diverse domains, including Atari video games, Go game, and StarCraft [29,30,31]. These methods do not rely on dynamic models or prior knowledge, show good application potential in region protection [32, 33]. In recent developments, deep reinforcement learning has yielded model-free, multi-dimensional, constrained, and stochastic optimization algorithms, which can make instant multi-dimensional and optimal policies [34]. However, the system in MRPE switches constantly due to agent damages, resulting in dynamic and multimodal rewards. Consequently, learning-based methods may encounter convergence challenges in MRPE. How to achieve rapid decision-making in a high nonstationary environment becomes a core issue of this research.

Fortunately, Evolutionary Reinforcement Learning (ERL) shows promise in addressing challenges encountered with optimization and learning-based methods [35]. ERL integrates evolutionary algorithms for global exploration and stable convergence with learning-based approaches to train agents for real-time decision-making. By combining evolutionary algorithms and Deep Deterministic Policy Gradient (DDPG) [36], Khadka and Tumer [37] propose an ERL method for a single agent. This ERL approach demonstrates significant performance improvements over DDPG across all benchmarks. Nugroho et al. [38] apply a ERL approach that combines genetic algorithms and DDPG for solving powered descent guidance landing problems. In their experiments, agents trained using ERL achieve the highest fitness scores in extensive Monte-Carlo simulations. However, the aforementioned ERL methods are primarily designed for single-agent decision-making tasks [39]. When extended to multi-agent cooperation scenarios, they encounter the challenge of effective collaboration [40]. Firstly, agents are unable to exchange experiences or share information with their counterparts during the training process. To address this limitation, Lowe et al. [3] developed a centralized training and decentralized execution framework, successfully extending DDPG to Multi-Agent DDPG (MADDPG). Owing to its outstanding performance, MADDPG has become one of the most prominent methods in multi-agent reinforcement learning [41, 42]. Unfortunately, there are scarcely any multi-agent ERL methods that embrace the excellent training framework of MADDPG. Therefore, in this work, we utilize MADDPG to address this research gap. Secondly, credit assignment for multiple agents is essential for effective coordination, involving the accurate decomposition of team objectives into individual rewards [43, 44]. It is notable that credit assignment is quite challenging, akin to attributing individual contributions in a football match based solely on scoring [45]. Furthermore, there is often a conflict of interest between agent teams and their members [46]. For example, a defender may receive a reward for approaching the closest intruders in MRPE. However, if all defenders prioritize the same intruder, the region may be left vulnerable to other intruders. Thus, we propose an Elite selection procedure to mitigate this conflict.

Fig. 1
figure 1

State transitions within one episode of MRPE can be understood through two typical scenarios. In the initial scene (left scene), blue circular defenders aim to prevent red rectangular intruders from reaching the yellow region. In the right scene, four events in MRPE are depicted

After determining the algorithm framework, the configuration of elements within ERL assumes a pivotal role in incentivizing policy optimizations [47], with a particular emphasis on fitness and reward functions. Training processes commonly confront challenges associated with sparse rewards, wherein agents receive reinforcement feedback or rewards only upon achieving specific milestones or completing predefined objectives. The scarcity of rewards can complicate credit assignment, making it challenging for agents to discern the impact of their actions on the overall task. As a consequence, designing precise instant rewards poses an enduring challenge in the realm of reinforcement learning. Fortunately, the incorporation of fitness within ERL provides a partial alleviation to these challenges. In the context of fitness, due to the stable convergence of evolutionary algorithms, it can be directly calculated based on sparse MRPE results, allowing for a more precise evaluation of team performance. Regarding the reward functions, existing studies in region protection [48,49,50,51] often utilize distance-based shape rewards, including positive intruder-region distances and negative defender-intruder distances. Additionally, they typically add a large value to the final reward when an episode is completed. These reward structures incentivize defenders to pursue intruders and prevent them from entering the region. However, in MRPE, the mainstream reward frequently mutates during operation, leading to significant challenges in training. Thus, we ingeniously redesign the reward functions from three aspects to improve performance, ultimately resulting in significantly enhanced success rates in extensive MRPE simulations.

In summary, this work has the following contributions:

  1. 1.

    Developing a multi-agent region protection method (MRPM) through ERL modifications, significantly improving success rates of autonomous protections in high nonstationary environments with fewer defenders.

  2. 2.

    Proposing an elite selection procedure in MRPM to resolve conflicts between a defender team and its members, thereby enhancing coordination in multi-agent systems.

  3. 3.

    Designing fitness and reward functions for MRPM to effectively drive policy optimizations, and verifying the effectiveness of MRPM by numerical simulations.

The remainder of this paper is organized as follows. “A challenging multi-agent region protection scene” constructs the challenging region protection environment MRPE. “Protection method combined by DE and MADDPG” develops a multi-agent protection method MRPM, by combining evolutionary optimization and MADDPG. “Numerical simulations” conducts numerical simulations and comparisons, to substantiate the effectiveness of the proposed learning method. Conclusions are finally drawn, and future works are presented in “Conclusions and future works.

A challenging multi-agent region protection scene

Throughout the paper, \(\Vert *\Vert \) is the 2-norm of a vector \(*\), and \(*^{\top } \) represents the transpose of a matrix \(*\). \(\rho _{i}^{t}\), \(v_{i}^{t}\), and \(a _{i}^{t}\) are coordinate, velocity, and acceleration of agent i in x- and y- directions at moment t, respectively. \( *_{i,j}^t\) denotes the relative value of \(*\) between agents i and j at a temporal instant t.

In Fig. 1, we develop a more practical multi-agent region protection environment, namely MRPE, where two opponent agents are considered, i.e., defenders {def} and intruders {int}. The defender team comprises a total of \(N_{\text {def}}\) agents, while the intruder team consists of \(N_{\text {int}}\) agents. MRPE is confined within a square area with a side length of \(b_c\). The primary objective of the defenders is to prevent the intruders from accessing a central circular region c, characterized by the coordinates \(\rho _c\) and a radius of \(d_c\). To ensure generality, at the beginning of MRPE, agents are randomly placed. Defenders are located at the center, while intruders are positioned at the edge of MRPE, i.e., \( 0< \Vert \rho _{i}^{0} -\rho _c \Vert< 0.3 b_c, i \in {\{\text {def}\}}, 0.8 b_c< \Vert \rho _{j}^{0} -\rho _c \Vert < b_c, j \in \{\text {int}\}\). The initial velocities and accelerations of all agents are set to zero, All the agents follow the kinetics and the constraints:

$$\begin{aligned} {\dot{\rho }}_i^t = v_i^t&,&{\dot{v}}_i^t = a_i^t, \\ \nonumber \left\| v_{i}^t \right\| \le v_{i}^{\max }&,&\left\| a_{i}^t \right\| \le a_{i}^{\max }. \end{aligned}$$
(1)

Damages are introduced to improve practicality. Specifically, the defenders are designed to be stronger than the intruders. An intruder is considered damaged if it collides with any other agent, whereas a defender is considered damaged only when it hits other companion. To keep track of agent states, a series of flags are introduced to record agent states. For example, \(\text {Dam}_i = 1\) represents that agent i is damaged, and \(\text {Dam}_i = 0\) otherwise. Once an agent is damaged (\(\text {Dam}_i = 1\)) or reaches the region c (\(\text {Arr}_i = 1\)), it is marked as done (\(\text {Run}_i = 0\)) and eliminated from MRPE.

An episode in the protection scenario begins with \(t = 0\) and is completed when all the intruders are marked as done or when the given time T is exhausted. The mission of the defenders is to successfully protect the region. Now, it is necessary to provide the following definition first.

Definition 1

(Success or failure of the protection): The region protection are successful (\( \text {Succ} =1\)), if and only if all the defenders \( \{ \text {def}\}\) are alive (\( \text {Run}_i = 1, \forall i\in \{ {\text {def}}\} \)), and all the intruders \( \{ \textrm{int}\}\) are damaged (\(\text {Dam}_j= 0, \forall j \in \{ \text {int}\}\)) in the given time T. Otherwise, the protection is failed (\( \text {Succ} =0\)).

According to Definition 1, the challenge of designing effective defense rules is more complex compared to intrusion rules. Consequently, this research utilizes control rules for intruders and introduces a learning-based protection approach for defenders. Essentially, this work signifies a confrontation between automatic intruders and autonomous defenders.

Intruder strategy: To achieve practical MRPM, a smart opponent is necessary. Instead of straight or random movements, we design an intruder strategy, including the region attraction and collision avoidance, so that intruders can swiftly advance toward the protection area while avoiding interception and collisions. Firstly, a neighbor set \( {{\textbf{Inei}}_i^ t} = \{ j|\left\| {\rho _{i,j}^{t} } \right\| \le {d}, j\in {\{ \text {def} \} \cup \{ \text {int} \}}, j\ne i \} \) is defined to collect the nearby agents, where \(d > 0\) is a set distance constant. Secondly, as Eq. (2) shows, the region attraction \(\textbf{F}_{ic}\) is executed, when \( {{\textbf{Inei}}_i^ t} \) is empty. Otherwise, the collision avoidance is proposed inspired by flocking controls [52, 53], which generates repulsive forces by an artificial potential function \(\varphi _{ij}\).

$$\begin{aligned} \textbf{n}_{\textbf{ij} }= & {} {\rho _{i,j}^{t} }/\left\| {{\rho _{i,j}^{t} }} \right\| , {\textbf{v}_{\textbf{ij}}} = {v _{i,j}^{t} }/\left\| {v _{i,j}^{t} }\right\| ,\nonumber \\ {\varphi _{ij}}= & {} - \frac{1}{2} \cdot \left[ { \left( 1 - \left\| {\rho _{i,j}^{t} } \right\| /{d_{\text {nei}}} \right) ^2} +{\left( \min \left\{ \cos < {{\textbf{n}}_{{\textbf{ij}}}},\mathrm{{ }}{{\textbf{v}}_{{\textbf{ij}}}} >,0 \right\} \right) ^2} \right] ,\nonumber \\ {\textbf{F}}_{{\textbf{ic}}}= & {} \left( {{k_\rho } \cdot \rho _{i,c}^t + {k_v} \cdot v _{i,c}^t} \right) /{{\left\| {{{k_\rho } \cdot \rho _{i,c}^t +{k_v} \cdot v _{i,c}^t}} \right\| }},\nonumber \\ a_{i}^{t}= & {} \left\{ \begin{array}{l} a_{i}^{\max }{{\textbf{F}}_{{\textbf{ic}}}}\mathrm{{, if }}\ {{\textbf{Inei}}_i^ t} = \emptyset \\ \sum \limits _{j \in {{\textbf{Inei}}_i^ t} } {{\varphi _{ij}} \cdot {{\textbf{n}}_{{\textbf{ij}}}}}/{N_{\textbf{Inei}}} \mathrm{{, if }}\ {{\textbf{Inei}}_i^ t} \ne \emptyset \\ \end{array} \right. \end{aligned}$$
(2)

where \(\textbf{n}_{ij}, \textbf{v}_{ij}\) are the unit vectors of relative positions and relative velocities between agents i and j. \(\textbf{F}_{ic}\) is the unit vector of the attractive accelerate from the center area. \(k_\rho , k_v \) are control parameters, set as 1 and 3.

Defender modeling: To introduce learning-based protection approaches for defenders, the protection process of defenders is approximated as a multi-agent extension of Markov decision processes [54], defined by a set of quintuples \((S, O_{i}, A_{i}, R_{i})\) to represent each defender. In MRPE, defenders are cooperative with partners and compete with the intruders. Thus, the global state space S describes the possible properties of all defenders. \(O_{i}\) denotes the local observations of defender i, and each defender can receive the individual observation correlated with the state \(S \longmapsto O_{i}\). \(A_{i}\) represents the action space, which contains all actions of the agent. \(S\times A_{i} \longmapsto R_{i}\) is a reward function assigning an extrinsic reward \(r_{i}\) for taking an action \(a_i\) under state \(s_{i}\).

As shown in Fig. 1, defender i generate an action \(a_i^t\) by its policy \(\pi _i\) and the local observation \(o_i^t, o_i^t \in O_i\) for an interaction with MRPE, i.e.,

$$\begin{aligned} a_i^t = \pi _i \left( o_i^t \right) . \end{aligned}$$
(3)

Thereby, defender i gets to next state \(s_i^{t+1}\) with a reward \(r_i^t\) for the next interaction with MRPE. A series of state transition quintuples \((s_i^t, a_i^t, r_i^t, s_i^{t+1})\) of defender i are stored into an experience replay buffer D. Based on the samplings from the replay buffer D, the goal of each defender is to optimize its policy \(\pi _i\) for generating an action \(a_i^t\) with the maximized long-term discounted cumulative reward \(R_i^t\) [44], i.e.,

$$\begin{aligned} {R_i^t} = r_i^{t+1}+\gamma r_i^{t+2} +\gamma ^2 r_i^{t+3}+ \cdots = \mathop \sum \limits _{k \in [0, T-t]} \gamma ^k r_i^{t+k+1}, \end{aligned}$$
(4)

where \(\gamma ( \in [0, 1])\) denotes a discount factor. To obtain the optimum policy \(\pi _i^*\), the state value function \({E_\pi } _i (R_i^t | s_i)\) is adopted as the objective function, which is the expectation of the long-term discounted cumulative reward \(R_i^t\) of defender i with the state \(s_i\).

Now, we are ready to develop the main technical problem for this paper.

Problem 1: For each defender governed by Eq. (1), how to design an algorithm to obtain the optimal action policy \(\pi _i^*\), by maximizing the expectation of the cumulative reward \({E_\pi } _i (R_i^t | s_i)\), i.e.,

$$\begin{aligned} \pi _i^* = \mathop {\textrm{argmax}} \limits _{a_i^t = {\pi }_i(o_i^t) } {{E_\pi } _i \left( R_i^t | s_i \right) }. \end{aligned}$$
(5)

It is notable that MRPE introduces several challenges in solving Problem 1. Firstly, defenders face the daunting task of protecting an entire area, which is often larger than typical scenes, while intruders only need to target a single point. This asymmetry in the protection task increases the difficulty for defenders. Secondly, the number of intruders exceeds that of defenders, and intruders possess avoidance abilities that make interception challenging. This leads to a limited interception time window for defenders, requiring the development of highly efficient policies \(\pi _i\) to counter intruder. Finally, both defenders and intruders can suffer damages during the region protection process, adding a dynamic and nonstationary aspect to MRPE.

Protection method combined by DE and MADDPG

In this section, we propose a region protection method MRPM to address Problem 1 under the challenges. Firstly, we utilize an actor neural network \({\pi _\theta }_i(s_i)\), parameterized by \(\theta _i\), to approximate a defender policy \(\pi _i\). Thus, the optimization of \(\theta _i\) plays a crucial role in effectively tackling Problem 1. To enhance this optimization, we combine two distinct algorithms: a gradient-free optimization technique called DE, and a gradient-based approach named MADDPG. DE is employed to generate a diverse set of \(\theta _i\), helping overcome the weak convergence of MADDPG. Meanwhile, MADDPG updates \(\theta _i\) and accelerates local optimization in DE. Considering the numerous parameters involved, Table. 1 provides separate introductions for parameters related to \(\theta _i\) in DE and MADDPG for improved clarity. Subsequently, an Elite selection strategy is proposed for MRPM, to solve the goal conflict between the defender team and its members. Finally, specific reinforcement learning elements are designed to handle MRPE challenges, including the action space, the state space, fitness and reward function.

Table 1 Parameters related to \(\theta _i\) in MRPM

Elementary DE

Elementary DE consists of four processes, including Initialization, Fitness Evaluation, Mutation, and Crossover. DE is a population-based algorithm, which operates a population matrix \(\textrm{POP}_g\) at every generation g, i.e., \(\text {POP}_g = \{x_1^{\top }, \cdots , x_i^{\top }, \cdots , x_{n_p}^{\top }\}^{\top } \), which consists of \(n_p\) individuals, and \(x_i\) denotes the ith individual. Firstly, in Initialization, DE randomly generates the nascent generation \(\text {POP}_0\) by

$$\begin{aligned} x_{i, j}=x_j^L+{\text {rand}}(0,1) \cdot \left( x_j^U-x_j^L\right) , \end{aligned}$$
(6)

where \(x_{i, j}\) is the jth element of individual \(x_i\). The symbols \(x_j^U\) and \(x_j^L\) denote the upper and the lower bounds of the jth element, respectively. Thereafter, Mutation and Crossover are adopted to generate a new population \(\text {POP}_g\) at every generation. Specifically, three individuals \(x_{r_1}\), \(x_{r_2}\), and \(x_{r_3}\) are randomly selected from \(\textrm{POP}_{g-1}\) to generate mutations by

$$\begin{aligned} {{\widehat{x}}}_i = x_{r_3} + {F_s}(x_{r_2} - x_{r_1}), i \ne r_1 \ne r_2 \ne r_3 \in [1, n_p], \end{aligned}$$
(7)

where \({{\widehat{x}}}_i\) is the ith mutation individual, and \(F_s (\in [0, 2])\) is the scaling factor. Subsequently, Crossover is employed to increase the diversity of solutions, where the original and the mutation individuals are chosen randomly according to the crossover possibility \(C_r (\in [0, 1])\),

$$\begin{aligned} {{\widetilde{x}}}_i = \left\{ \begin{array}{l} {{\widehat{x}}}_i \ ({\text {rand}}_i \le C_r \cup i = {j_{\text {rand}}})\\ x_i \ ({\text {otherwise}}) \end{array} \right. , \end{aligned}$$
(8)

where \({{\widetilde{x}}}_i\) is the ith offspring individual, \({\text {rand}}_i (\in [0,1])\) is an uniformly distributed value, and \(j_{\text {rand}} (\in [1,n_p])\) is an random integer value. Finally, Fitness Evaluation are carried out to select the best individual from the offspring.

Elementary MADDPG

MADDPG is a reinforcement learning method with an actor-critic frame. Specifically, since the cumulative reward \(R_i^t\) in Eq. (4) is hard to solve, a critic neural network \(Q_i^\pi (s, a)\) parameterized by \( \varepsilon _i\) is usually adopted to approach \(R_i^t\) for each defender. Meanwhile, an actor network \({\pi _\theta }_i(o_i^t)\) parameterized by \(\theta _i\) is trained to approximate the policy \(\pi _i\). The actor network \({\pi _\theta }_i\) choose actions \(a_i^t\) according to the local observation \(o_i^t\), whereas the critic network \(Q_i^\pi \) produce an approximate value \({\widehat{Q}}_i\) of \(R_i^t\), so as to evaluate the action decided by the actor network. Besides, critic networks require co-agent information (states s and actions a) to handle the nonstationarity of MRPE.

To solve Problem 1, the policy gradient \({\nabla _{{\theta _i}}}J({\theta _i})\) [3] is firstly derived as:

$$\begin{aligned} {\nabla _{{\theta _i}}}J({\theta _i}) = E \left[ {\nabla _{{\theta _i}}}{{\pi _\theta } _i}({a_i}|{o_i}) {\nabla _{{{\pi _\theta } _i}}}Q_i^\pi \left( s,{a_{1,\ldots ,N}}{|_{{a_i} = {\pi _i}({o_i})}} \right) \right] .\nonumber \\ \end{aligned}$$
(9)

To approach the cumulative reward \(R_i^t\), the goal of the critic network \(Q_i^\pi \) is to minimize the sum of square loss \(L({\varepsilon _i}) \), between the critic network output \({\widehat{Q}}_i\) and \(R_i^t\), i.e., \(L({\varepsilon _i}) =\mathop {\textrm{min}} \limits _{\epsilon _i }E[ ({\widehat{Q}}_i - R_i)^2 ] \). Thus, the gradient of critic networks \({\nabla _{{\varepsilon _i}}}L({\varepsilon _i})\) is calculated by

$$\begin{aligned} {\nabla _{{\varepsilon _i}}}L({\varepsilon _i}) = E\left[ \left( Q_i^\pi (s,{a_{1,\ldots ,Ni}}) - y \right) \cdot {\nabla _{{\varepsilon _i}}}Q_i^\pi \left( s,{a_{1,\ldots ,Ni}}\right) \right] ,\nonumber \\ \end{aligned}$$
(10)

where \( y = r_i^t+ \gamma Q{_i^{\pi ^ \prime } (s^{t+1},a^{t+1}_{1,\ldots ,N}|_{a^{t+1}_i = {\pi }_i'(o_i)})}\), which is an approximate value calculated by the temporal difference (TD), and \( Q_i^{\pi ^ \prime }, \pi _i ^ \prime \) represent the target networks with delayed parameters \( {\varepsilon _i} ^ \prime , {\theta _i}^ \prime \). Sampling transitions \((s_i^t, a_i^t, r_i^t, s_i^{t+1})\) from the replay buffer D, the network parameters \(\theta _i, \varepsilon _i\) are iteratively updated by \(\theta _{i}^{k+1}= \theta _{i}^k+\alpha _c {\nabla _{{\theta _i}}}J({\theta _i}), \varepsilon _{i}^{k+1}= \varepsilon _{i}^k-\alpha _c \nabla L({\varepsilon _{i}})\). Since MADDPG is a gradient optimization method, defender policies are apt to fall into local optima, failing to intercept all the intruders.

Fig. 2
figure 2

MRPM framework combines DE for global exploration with MADDPG for defender training. In the diagram, DE-related operators are denoted in blue, MADDPG in orange, and MRPE in green. Numerical labels signify the execution sequence of these operators within the framework

The frame of the proposed MRPM

MRPM frame is presented in Fig. 2, which integrates DE for global exploration and employs MADDPG to train defenders. Operators in DE, MADDPG, and MRPE are color-coded as blue, orange, and green. Besides, numerical labels indicate the execution sequence of the operators, and here is an overview of these operators:

Initialization: At the beginning, a DE population \({\textrm{POP}_g}\) are initialized at the progenitor (\(g = 0\)) by Eq. (6), where each individual \(x_i\) in \(\text {POP}_g\) represents the actor network parameters \(\theta _i\) of a defender team, i.e., \(x_i = \{\theta _1^{\top }, \theta _2^{\top }, \cdots ,\theta _i^{\top }\}\). Additionally, a recorder denoted as \(x_{PG} = \{\theta _1^{\top }, \theta _2^{\top }, \cdots ,\theta _i^{\top }\}\) is randomly initialized, which stores the actor parameters of a defender team trained by MADDPG. Subsequently, generation loops is executed, comprising operators 1–8 in Fig. 2.

Interaction with MRPE (DE \(\rightarrow \) MADDPG): In the blue operators 1–4, each defender team \(x_i\) in \(\text {POP}_g\) continuously interacts with MRPE in a whole episode for fitness evaluation. Detailed explanations regarding fitness evaluation will be provided in the following section. Simultaneously, during the blue operator 2, each interaction with MRPE generates a state transition \((s_i^t, a_i^t, r_i^t, s_i^{t+1})\), which is stored in the replay buffer D, as illustrated in the orange operator 3. Thus, by passing state transitions, this procedure can be regarded as an interface from DE to MADDPG.

MADDPG training: In orange operator 4, MADDPG updates \(x_{PG}\) by sampling from the replay buffer D. This update is carried out using policy gradients defined in Eq. (9), specifically: \(\theta _{i}^{k+1}= \theta _{i}^k+\alpha _c {\nabla _{{\theta _i}}}J({\theta _i})\), facilitating the training of defender teams.

Elite selection (MADDPG \(\rightarrow \) DE): In the green operators 5–6, Elite selection is proposed to create an elite population \(\text {POP}_g^e\) by combining \(\text {POP}_g\) from DE and \(x_{PG}\) from MADDPG. Specially, each individual \(x_i\) in \(\textrm{POP}_g\) is assessed based on two fitness values: team fitness \(F_T\) and member fitness \(F_m\). These fitness values capture the distinct objectives of defender teams and their individual members, providing essential information for the selection process. The individuals in \(\text {POP}_g\) are ranked in descending order according to their team fitness \(F_T\), forming a prime population \(\text {POP}_g^e\). Additionally, a defender actor \(\theta _i\) with the highest member fitness \(F_m\), marked as \(\text {Max}_i\), is selected to create a team with elite members \(x_e\). This team replaces the penultimate individual \(x_{-2}\) in \(\text {POP}_g^e\). Finally, the worst individual \(x_{-1}\) is replaced with the MADDPG team \(x_{PG}\). The elite selection process completes \(\text {POP}_g^e\), which is directly assigned to the new generation \(\text {POP}_{g+1}\). Similarly, by passing \(x_{PG}\) to \(\text {POP}_{g+1}\), this procedure is an interface from MADDPG to DE.

Actor evolution: In the blue operators 7–8, a new actor population \(\text {POP}_{g+1}\) is generated from \(\text {POP}_g^e\) using DE operators as described in Eqs. (78). Consequently, the actor population \(\text {POP}_g\) evolves after generation loops.

Optimum team actor selection: Finally, the MADDPG recorder \(x_{PG}\) is also continuously updated after generation loops The optimum team actor \(x_i^*\) can be selected from \(\textrm{POP}_g\) and \(x_{PG}\).

The proposed elements in MRPM

To enhance the performance of MRPM, meticulous design of its elements is particularly important. These elements encompass the action space \(A_i\), state \(s_i^t\), the fitness measures (\(F_T, F_m\)), and the reward function \(r_i^t\). The specific techniques are explained as follows.

Action space with policy ensembles: The design of the action space demands both comprehensiveness and efficiency. In accordance with the dynamic model specified in Eq. (1), the action space \(A_i\) for defender i is defined to include five directions of acceleration: up \(a_\uparrow \), down \(a_\downarrow \), left \(a_\leftarrow \), right \(a_\rightarrow \), and stop \(a_s\). Each acceleration value is bounded within the range [0, 1]. The final action \(a_i^t\) is represented as \(a_i^t = (a_\rightarrow - a_\leftarrow , a_\uparrow - a_\downarrow )\).

State space with two moment properties: It has been discovered that incorporating sequential states aids in decision-making. Consequently, the state space \(s_i\) is defined based on the properties of the current and previous moments, represented as \(s_i^t = [p_i^{t-1}, p_i^t]\). Each moment property \(p_t\) includes the protection situations of defender i, denoted as \(\chi _i^t = {[\rho _{i}^{t},v_{i}^{t},\text {Run}_i]}\), as well as the relative situations between defender i and other agents, denoted as \( \chi _{i,j}^t ={[\rho _{i,j}^t,v_{i,j}^t,\text {Run}_{i,j}^t]}, j \in {\{ \text {def}\} \cup \mathcal \{ \text {int}\}}, j \ne i\).

Fitness function: Fitness functions serve as essential criteria for optimizing actor parameters \(\theta _i\), regulating each defender’s behavior to achieve a common task. DE is insensitive to sparse rewards, allowing the fitness to be obtained after completing an MRPE episode instead of at every protection moment. This allows the interception result to be used as the fitness, leading to more accurate optimization of the defender behavior. Specifically, a result is obtained after a Roll procedure, where a team policy \(x_i \in \text {POP}_g\) continuously interacts with MRPE in an episode. These results can be classified into two scenarios: protection success and protection failure. For a more detailed recording, we consider metrics such as the number of damaged intruders \((N_{\text {int0}}-N_{\text {arr}})\), the count of surviving defenders \(N_{{\text {def}1}}\), and the completion time \(T_{c}\). These metrics collectively form the team fitness \(F_T\) for performance evaluation, i.e.,

$$\begin{aligned}{} & {} F_T = \nonumber \\{} & {} \left\{ \begin{array}{l} ({N_{\text {int0}}-N_{\text {arr}}} ) \cdot {10^4} + {N_{{\text {def}1}}} \cdot {10^3} + ({T} - {T_{c}}),\ \text {if Succ =1}\\ ({N_{\text {int0}}-N_{\text {arr}}} ) \cdot {10^4} + {N_{{\text {def}1}}} \cdot {10^3} + T_{c},\ \text {if Succ} =0. \end{array} \right. \end{aligned}$$
(11)

where \( {N_{\text {int0}}, N_{\text {arr}}} \) are the number of done and arrived intruders, and thus, \( ({N_{\text {int0}}-N_{\text {arr}}} )\) is the number of damaged intruders. For successful protection, a shorter completion time is considered more favorable. Thus, the remaining time \(({T} - {T_{c}})\) is selected as the metric. However, in the case of failed protection, a longer completion time is preferred \(T_{c}\), as it indicates that the defender is exerting maximum effort to delay the intruders. Similarly, for a defender, let \(n_{\text {int0}}\) be the total number of intruders intercepted by it, and \(T_{\text {int}}\) be the time it intercepts the last intruder. Thus, the member fitness \(F_m\) is calculated by \(F_m = n_{\text {int0}}\times {10^4} + \text {Run}_i \times {10^3} + (T-T_{\text {int}})\), which encourage a defender to intercept more intruders with the least amount of time

Reward function: Real-time rewards are the only motivation for the learning. However, accurately decomposing the team protection performance into member rewards is challenging. To overcome this, an ingenious reward function for MADDPG is designed in Eq. (12) by three terms, to approximate the protection performance.

$$\begin{aligned} r_i^t = {r_{\text {usu}}}_i^t + {r_{\text {cha}}}_i^t+{r_{\text {rev}}}_i^t, \end{aligned}$$
(12)

where \( {r_{\text {usu}}}_{i}^{t}, {r_{\textrm{cha}}}_{i}^{t}, {r_{\text {rev}}}_i^t\) denote the usual, the change and the revise rewards, respectively. \({r_{\text {usu}}}_i^t\) encourages defenders to pursue intruders, whereas \({r_{\text {cha}}}_i^t\) enhances this promoting effect. \({r_{\text {rev}}}_{i}^{t}\) updates defender behavior according to interception results. All the terms cooperatively shape the reward \(R_i\) to match task objectives, so that the actor network parameters \(\theta _i\) can be prevented from falling into local optima.

The design of \({r_{\text {usu}}}_{i}^{t}\) follows a similar approach to the mainstream method [48,49,50,51], represented as:

$$\begin{aligned} {r_{\text {usu}}}_{i}^{t}{} & {} = -k_\alpha \left\| {\rho _{i,{{\tilde{i}}}}^t} \right\| + k_\alpha \left\| {\rho _{{{{\tilde{i}}}},c}^t} \right\| + { \sum \limits _{j \in \{\text {int}\}} {\left\| {\rho _{j,c}^t} \right\| } /N_{\text {int}}}\nonumber \\{} & {} \qquad - \sum \limits _{j \in {\textbf{nei}}_i^t} {{{\left( 1 - \left\| {\rho _{i,j}^t} \right\| /{{d}}\right) }^2}}, \end{aligned}$$
(13)

where \({{\tilde{i}}}\) is the nearest intruder of defender i, and \({{\tilde{i}}}\) is related to the time. An attention weight \(k_\alpha \) is introduced to terms including \({{\tilde{i}}}\), which avoids defenders constantly swinging among different intruders. \( {\textbf{nei}}_i^t\) is the neighbor set of defender i, i.e., \({\textbf{nei}}_i^t = \{ j|\left\| {\rho _{i,j}^t} \right\| \le {{d}}\}, j \in \{ \text {def}\}\). In Eq. (13), the first term is a negative defender-intruder distance, prompting defenders to approach their nearest intruders. The second and the third terms are positive intruder-region distances, urging defenders to drive intruders out. Besides, the attention weight \(k_\alpha \) is assigned to the second term, so as to improve the influence of the nearest intruder. The last team is proposed by defender-defender distances for collision avoidance, ensuring the safety navigation of defenders.

The change reward \({r_{\text {cha}}}_{i}^{t}\) is defined as the derivative of \({r_{\text {usu}}}_{i}^{t}\), which encourages agents to make progress and penalize them for retreating or moving away from their objectives. This reward enhances the promoting effect of \({r_{\text {usu}}}_{i}^{t}\), expressed as:

$$\begin{aligned} {r_{\text {cha}}}_{i}^{t} = {r_{\text {usu}}}_{i}^{t} - {r_{\text {usu}}}_{i}^{t-1}. \end{aligned}$$
(14)

However, there are certain events that cannot be adequately captured by \({r_{\text {usu}}}_{i}^{t}\) and \({r_{\text {cha}}}_{i}^{t}\), such as agent damage or protection success. Typically, the reward \(r_{i}^{t}\) is adjusted directly by incorporating sharp increases or decreases, introducing high stationary. To mitigate the impact of dramatic changes, we introduce the revised reward \({r_{\text {rev}}}_{i}^{t}\), which decomposes the sharp value \(S_v\) into a sequence of differences:

$$\begin{aligned} {r_{\text {rev}}}_{i}^{t}= S_v \cdot { \frac{{2 \Delta t^2}}{{{T_{e}}({T_{e}} + \Delta t)}}} \cdot t, \end{aligned}$$
(15)

where \(T_{e}\) represents the moment of event happening, and \( \Delta t \) is the time interval. It is observed that larger absolute values are assigned to \({r_{\text {rev}}}_{i}^{t}\) sequences that are closer to the event. Besides, the sharp value \(S_v\) is obtained according to different events: (a) \(S_v= -T_{e}\), if the intruder \({{\tilde{i}}}\) arrives. \(k_{\text {event}}= -T_{e}/N_{\text {int}}\), if other intruders arrives. (b) \(S_v= 2T_{e}\), if the intruder \({{\tilde{i}}}\) is intercepted. (c) \(S_v= -2T_{e}\), if the defender is damaged. (d) \(S_v= -5T_{c}\), if Succ = 1, and \(S_v=5T_{c}\), if Succ = 0.

Algorithm 1
figure a

MRPM

Algorithm 1 outlines the proposed MRPM, encompassing steps 1–26 for the training process, with step 27 dedicated to the autonomous decision-making process in MRPE. The fitness and reward functions serve as the unique criteria in optimizing actor parameters \(\theta _i\), and their design is meticulously tailored to the objective and outcomes of a defender team. Consequently, these functions enable the precise regulation of each agent’s behavior to collectively accomplish a shared task after training.

Numerical simulations

Numerical simulations are conducted to assess the effectiveness of MRPM. Specifically, we compare MRPM with other protection methods and variations in rewards across two different MRPE cases. MRPM-I is used for method comparisons, while MRPM-II focuses on variations in reward functions. Here are the details.

Parameter settings and evaluation criteria

In the training process, the parameters can be divided into three classes, including MRPE, MADDPG, and DE. The configuration of these parameters is crucial for ensuring convergence and effectiveness. For MRPE parameters, they can be selected according to the actual interception demand, such as the location and the size of the protected region. MRPE parameters are configured in Table 2, which is bounded within a coordinate system of range \([0, 200]^2\), featuring a defense circular region centered at \(\rho _c = (100, 100)\) with a radius \(d_c = 12\). Since there are more intruders than defenders, it is almost impossible for the defenders to successively intercept all attackers without the speed advantage. Thus, the maximum acceleration \(a_i^{\max }\) of the intruders and the defenders are 2, 4, whereas the maximum velocities \(v_i^{\max }\) are 6, 12. A time limit T of 30 s is set, allowing all intruders to reach their destination unimpeded by defenders. To assess performance, each method is evaluated on two different MRPE cases: one with 2 defenders versus 3 intruders, and the other with 3 defenders versus 5 intruders. Importantly, all parameters are normalized during agent state calculations to enhance learning performance.

Table 2 MRPE Configuration
Table 3 Structure and parameter of actor and critic networks

The selection of MADDPG parameters can be guided by the recommendations in [33]. Based on prior experiences, We set the discount factor \(\gamma \) to 0.95, and the learning rates \(\alpha _c\) for both the actor and critic networks are set to 0.02. The replay buffer size is \(5 \times 10^{5}\), and the batch size of updating is 256. Besides, networks with 5 dense layers have demonstrated favorable performance in region protection problems. For networks employing 5 dense layers, several advantages can be delineated. Firstly, such networks demonstrate an enhanced capability to apprehend highly intricate and non-linear associations inherent within the dataset. Secondly, the increased depth of the network facilitates the acquisition of hierarchical and abstract representations of the input data, thereby enabling the nuanced capture of intricate relationships within the dataset. This architecture proves particularly adept at generating actions predicated on states or approximating cumulative rewards based on states and actions, as they both are complex representations of the input data. Conversely, notable disadvantages are discernible. The networks may require a large amount of labeled data for effective training. Notably, reinforcement learning mechanisms can serve as a mitigating factor, continuously sampling data for ongoing training efforts. However, the networks are computationally intensive during training. Training time and resource requirements can become substantial. These considerations may render the application of networks with 5 dense layers less practical for certain domains. Thus, the deep layers and properties of the actor and critic networks are provided in Table 3. In the actor networks, the inputs are the local observations \(o_i^t\), and the outputs are the actions \(a_i^t\), represented as \((a_\rightarrow - a_\leftarrow , a_\uparrow - a_\downarrow )\) In the critic networks, the inputs are composed of all the states s and actions a, and the output is an approximate value \({\widehat{Q}}_i\) for \(R_i^t\).

Referring to [55], DE is configured with the following parameter settings: the scaling factor \(F_s\): 0.48, the crossover probability \(C_r\): 0.25 the population size \(n_p\): 10. Finally, the maximum number of generations is set as 3.5 million to ensure convergence and stability.

For the evaluation criteria, we select the following metrics: success rates (SRs), fitness function \(F_T\), and team reward function \(R_T\). SRs is the most intuitive and critical criterion. To mitigate the impact of randomness, defenders trained by each method autonomously execute the mission in \(1 \times 10^5\) different episodes for each case, and the number of successful protections is determined according to Definition 1. In addition to SRs, we consider more detailed metrics for data recording, including the number of damaged intruders \((N_{\text {int0}}-N_{\text {arr}})\), the count of surviving defenders \(N_{{\text {def}1}}\), and the completion time \(T_{c}\). Since these metrics collectively contribute to the fitness function \(F_T\), \(F_T\) is select as another criterion. Subsequently, to examine the learning effect, we introduce the team reward \(R_T\), which is the accumulation of the true rewards of the defender team in one episode, calculated as follows:

$$\begin{aligned} R_T =\sum \limits _{ t \in T} \sum \limits _{ i \in \{ \textbf{def}\} }{r_i^t} \end{aligned}$$
(16)

Finally, it is expected that these three criteria will exhibit similar growing trends as methods become more effective in training.

Validation of intruder strategy

In MRPE, designing an intelligent intruder and its strategies is crucial. Instead of random actions, we have developed an intruder strategy that enables swift movement towards the protection area while avoiding interception and collisions. To demonstrate its effectiveness, we conduct simulations comparing the random intruder strategy with our proposed strategy, shown in Fig. 3. In these simulations, the number of defenders matches the number of intruders, and the defenders are stationary or exhibit inefficient protection movements. The inefficient defenders are consistently attracted toward the nearest intruder, with a maximum acceleration and velocity limited to 1 each.

Fig. 3
figure 3

Four examples of intruders controlled by the random and the proposed strategy. When confronted by static or inefficient defenders, the random intruders consistently fail to reach the region. In contrast, all intruders employing the proposed strategy safely arrived within less than 24 s

In a series of ten simulations, the random intruders consistently fail to reach the region. In contrast, all intruders employing the proposed strategy safely arrived within less than 24 s when confronted by static or inefficient defenders. Specifically, Fig. 3 (1) depicts an example of random intruders facing static defenders. Despite the absence of disturbances to the intruders, they are unable to reach the region. In Fig. 3 (2), random intruders confront inefficient defenders. With the exception of one intruder leaving the MRPE area, the remaining intruders are nearly entirely damaged. Two collide, one is intercepted, and the sole survivor remains merely because the MRPE time is spent up, and it will soon be captured. Conversely, Fig. 3 (3) and (4) illustrate the outcomes of the proposed intruder strategy, where all intruders safely reach the region. These examples underscore the superiority of our intruder strategy, which greatly improves the protection difficulty.

MRPM-I

The MRPM-I experiment involved four methods: DE, MADDPG, MRPM-we (without elite selection), and the proposed MRPM. DE represents an optimization-based protection method. MADDPG is a well-established reinforcement learning algorithm for multi-agent protection and confrontation [32, 33, 49,50,51]. MRPM-we can be considered a standard ERL method. Notably, all the methods except DE incorporate policy gradients into their approaches. Additionally, all the methods except MADDPG incorporate DE into their approaches.

Table  4 presents SRs of the methods in \(1 \times 10^5\) episodes for the two MRPE cases. Regardless of the case, the proposed MRPM achieves the highest SRs of 97.02% and 91.27%, outperforming the other methods. In contrast, DE cannot successfully protect the region even once, highlighting the challenge of the scenario. Furthermore, SRs of MADDPG in the two cases are 43.36% and 40.49%, while the SRs of MRPM-we are 68.31% and 72.26%, demonstrating the superiority of ERL. Notably, each method shows similar SRs across both cases, indicating that the methods are not highly sensitive to the number of agents.

Table 4 SRs of the methods in \(1 \times 10^5\) episodes for the two MRPE cases

To monitor the training process, the reward and fitness curves for these methods in the two cases are shown in Figs. 4 and 5. The methods are tested every 10 generations, and the average results of 10 experiments are reported. The learning curves suggest that defenders are still able to intercept some intruders and survive in most protection failure episodes. This demonstrates the difficulty of achieving successful protection, the potential for improvement through training, and the necessity of using both the fitness and the reward function as performance criteria.

Fig. 4
figure 4

Learning curves of MRPM-I for case 1. The left graph displays the average reward for each generation, while the right one shows the average fitness. Each method is distinguished by a different color. The shaded areas represent the confidence intervals for each method, indicating the mean ± standard deviation across ten different random episodes

Fig. 5
figure 5

Learning curves of MRPM-I for case 2. The left graph displays the average reward for each generation, while the right one shows the average fitness. Each method is distinguished by a different color. The shaded areas represent the confidence intervals for each method, indicating the mean ± standard deviation across ten different random episodes

With the exception of DE, all methods exhibit a gradual increase in both reward and fitness curves, which underscores the importance of policy gradients. DE initially shows a slight growth in rewards, but they plateau at around 0.4 after one million training generations. The fitness curves do not show significant improvement. This can be attributed to the challenging nature of MRPE and the high-dimensional nature of actor network parameters, where DE requires extensive exploration to find a better solution in the vast parameter space. Consequently, the fitness curves for DE remain at a consistently low level. DE can only delay intruders to some extent but struggles to intercept them with the current training generation, according to Eq. (11). Therefore, policy gradients indeed accelerate the DE optimization process in this problem. However, on the other hand, lacking DE makes that MADDPG performance worse than MRPM-we and MRPM, as DE plays an indispensable role in achieving diverse samples. Obviously, MADDPG gets trapped in local optima, as it only manages to intercept an average of about 2.3 intruders in case 1 and 4 intruders in case 2. Additionally, the shadow area of MADDPG are larger in the two cases, indicating higher instability. Thus, it becomes evident that the combination of DE and policy gradients is essential. MRPM outperforms MRPM-we in terms of final convergence reward and fitness value, showcasing the effectiveness of the proposed elite selection strategy.

Fig. 6
figure 6

Learning curves of MRPM-II for case 1. The left graph displays the average reward for each generation, while the right one shows the average fitness. Each method is distinguished by a different color. The shaded areas represent the confidence intervals for each method, indicating the mean ± standard deviation across ten different random episodes

Fig. 7
figure 7

Learning curves of MRPM-II for case 2. The left graph displays the average reward for each generation, while the right one shows the average fitness. Each method is distinguished by a different color. The shaded areas represent the confidence intervals for each method, indicating the mean ± standard deviation across ten different random episodes

Fig. 8
figure 8

Region protection processes of MRPM-I methods in case 2

Fig. 9
figure 9

Region protection processes of MRPM-II methods in case 2

MRPM-II

MRPM-II consists of four types of reward functions: MRPM-u, MRPM-ur, MRPM-uc, and the proposed MRPM. In MRPM-u, the mainstream reward design method is used. The reward values, denoted as \({r_{\text {usu}}}_{i}^{t}\), are calculated at each time step, and a sharp value \(S_v\) in Eq. (15) is added when specific events occur. MRPM-ur removes the change reward \({r_{\text {cha}}}_{i}^{t}\) component from Eq. (12), while MRPM-uc removes the \({r_{\text {rev}}}_{i}^{t}\) term.

Similarly, Table 4 records SRs of MRPM-u, MRPM-ur, and MRPM-uc. MRPM-u fails to achieve success in both cases, underscoring the ineffectiveness of using a single reward term \({r_{\text {usu}}}_{i}^{t}\) without a hybrid approach. In case 1, MRPM-ur and MRPM-uc achieve SRs of 60.74% and 13.65%. However, as the number of agents increases in case 2, there is a notable decline in SRs, dropping to 3.21% and 0.40% for MRPM-ur and MRPM-uc, respectively. Particularly, MRPM-ur demonstrates relatively high SR in case 1 but struggles to succeed in case 2. The performance decline can be attributed to the lack of the revised reward \({r_{\text {rev}}}_{i}^{t}\), which shows increased significance in scenarios with a higher number of agents. In cases with fewer agents, the role of \({r_{\text {rev}}}_{i}^{t}\) is limited due to the reduced frequency of events. Finally, when compared with other MRPM-II methods, the proposed MRPM still achieves the highest SRs in both cases, highlighting the effectiveness of the fitness design.

The learning curves of these methods are also presented in Figs. 6 and 7. From the reward curves, it can be observed that MRPM, MRPM-ur, and MRPM-uc exhibit gradual improvement in finding better actor parameters. However, MRPM-u shows a constantly fluctuating reward curve with no significant improvement. Based on the fitness curves, MRPM-u’s training process appears highly ineffective as it manages to intercept an average of only 0–1 intruders in both cases. Consequently, MRPM-u fails to achieve protection success, consistent with the findings presented in Table 4. This could be attributed to the high nonstationarity of MRPE, where sharp values \(S_v\) are frequently added to the rewards. Regarding the fitness curves, only MRPM successfully captures action networks near the global optimum, enabling it to intercept all intruders after 2.5 training generations. Besides, the reward and fitness curves of MRPM exhibit similar trends, demonstrating the good interpretability of the reward design. On the other hand, MRPM-ur and MRPM-uc become trapped in local optima and fail to achieve the same level of performance. For MRPM-ur, it can intercept 2–3 intruders in both cases. Although SRs of MRPM-ur decrease from case 1 to case 2, the number of interceptions remains almost the same. Thus, the interception capability do not change, and the decent SRs are mainly due to the increased number of intruders. For MRPM-uc, its performance lags significantly behind MRPM-ur, underscoring the stability and necessity of the change reward \({r_{\text {cha}}}{i}^{t}\), as it strengthens defender behaviors of pursuing intruders. Therefore, the absence of \({r\text {cha}}_{i}^{t}\) results in sparse rewards, leaving defenders uncertain about their actions. Ultimately, MRPM still achieve the best learning performance in MRPM-II, and thus, the proposed reward in Eq. (12), which is enhanced by the three terms, demonstrates its superiority.

Protection processes

Since case 2 is more complex than case 1, we illustrate representative MRPE processes of case 2 in Figs. 8 and 9, so as to offer a better understanding of the performance variations among each method. Figure 8 shows the MRPM-I protection process, where Fig. 8a–c are successful scenarios. In these three scenarios, intruder 6 is the closest to the region. Compared to the other methods, MRPM has the farthest distance between intruder 6 and the region. This outcome validates the high interception efficiency of MRPM, which can be attributed to the proposed elite selection strategy, enabling the rapid formation of a team comprising defenders with high fitness values.

A notable observation is that defenders in MRPM adopt a spiral chasing method, allowing them to maintain maximum speed for a longer duration. This approach results in relatively smooth trajectories with larger turning radii. In contrast, defenders in MADDPG tend to move directly towards intruders. However, after intercepting the first batch of intruders (5, 7, 8), defenders in MADDPG are required to slow down and turn around to pursue intruders 4 and 7, resulting in a waste of time. Consequently, the trajectories in MADDPG exhibit three sharp twists. Although the spiral chasing method employed by MRPM is highly efficient, it necessitates higher attack accuracy due to the higher velocities of the defenders, which are challenging to adjust in a short time. If an intruder successfully avoids an attack due to low accuracy, it results in significant time wastage in recapturing the escaping intruder. For instance, defender 2 in MRPM-we engages in a prolonged confrontation with intruders 4 and 5, further demonstrating the fast and accurate interception capabilities of MRPM, where intruders have limited chances to escape. Lastly, DE initially sends defenders to approach and delay intruders. However, the defenders lack the ability to turn around, indicating that the training generation is far from sufficient for DE. This further validates the necessity of policy gradients in achieving effective defense strategies.

Representative MRPM-II episodes are depicted in Fig. 9. In MRPM-uc, defenders consistently reduce the distances to intruders, resulting in the damage of intruders 5 and 7. However, due to the absence of event incentives, defenders fail to execute accurate and efficient actions for the final attack. As a result, intruders 4, 6, and 8 reach the region. In MRPM-ur, defenders patrol around the region, waiting to collide with intruders. However, they cannot swiftly approach intruders due to the lacks of the change reward. Consequently, intruders 6–8 successfully reach the region. In MRPM-u, defenders are capable of approaching intruders before the occurrence of turbulence caused by events. However, they may struggle to execute normal actions once an event takes place. For instance, defender 1 loses its interception ability after intruder 5 escapes. Defender 2 oscillates between intruders 4 and 8, failing to sustain pursuit once intruder 8 arrives. Defender 3 gets trapped in chaotic movements after colliding with intruder 6. As a result, MRPM-u is limited to environments with low levels of nonstationarity, where events have minimal impact on the effectiveness of defender actions. Finally, by comparing these variants, it becomes evident that MRPM exhibits superiority over alternative approaches.

Conclusions and future works

This paper firstly develops a multi-agent protection environment MRPE featuring fewer defenders, defender damages, and intruder evasion strategies targeting defenders. MRPE is designed to be more practical but also poses challenges for traditional protection methods due to its high nonstationarity and limited interception time window. To address these challenges, the corresponding protection method MRPM is proposed by combing DE and MADDPG. DE facilitates diverse sample exploration and overcomes sparse rewards, while MADDPG trains defenders and expedites the DE convergence process. Subsequently, an elite selection strategy tailored for multi-agent systems is devised to enhance defender collaboration. Besides, the fitness and reward functions are ingeniously designed for to effectively drive policy optimizations. Finally, extensive numerical simulations are conducted to assess the effectiveness of MRPM. These simulations encompass two MRPE cases and involve a comprehensive comparison between MRPM and other approaches, including MADDPG, DE, and MRPM without an elite selection strategy. The investigation also delves into the influence of different reward schemes on the outcomes. The results unequivocally highlight a substantial enhancement in collaborative defense success achieved by MRPM, when compared to the other considered methods.

Our proposed method aims at steering the autonomous decision-making of unmanned system swarms, notably unmanned surface vessels (USVs). However, it is worth acknowledging that the underactuation dynamics inherent to USVs may limit the feasibility of certain actions. To address this challenge, our future research will incorporate specific reward terms to handle the constraints of underactuation dynamics. Furthermore, our proposed method effectively handles scenarios involving a small number of agents. However, as the number of agents increases, the curse of dimensionality becomes a significant concern. In future research, we intend to develop an encoder method for state and action features to mitigate this dimensionality issue.