Introduction

In the context of aerial military competition, the pursuit-evasion problem of Unmanned Aerial Vehicles (UAVs) serves as a typical example. It can be described as a reciprocal confrontation between two aircrafts based on some form of conflicting interests. Researching the optimal solutions for this pursuit-evasion problem involving aircraft is of paramount practical significance. The pursuit-evasion problem, as a typical class of Differential Games (DG), is widely present in the natural world or in military domains [1].

Game theory is a theoretical and methodological framework that models real-world situations of conflict, competition, and cooperation as mathematical models. Differential games are an important kind of game theory. It refers to participants who use differential equations to describe the phenomena or rules of the game while playing the game [2]. Differential games originally originated from military problems; however, in recent years they have been applied not only in military research, but also extensively in various fields such as life and economics [3]. Differential games are a method that combines game theory with modern control theory. It has evolved from unilateral optimal control to bilateral (or multilateral) optimal control and from static games to dynamic conflicts. The application areas of this method have become increasingly extensive. Military problems often involve complex non-linear models rather than linear control problems, such as the fixed-time prescribed performance trajectory tracking control for the unmanned surface vehicle with unknown dynamics and disturbances [4], naval combat, satellite interception, and missile defence [5]. Differential game theory can provide effective modelling and analysis for both unilateral and multilateral conflict scenarios. It can also be solved using control methods such as optimal control, leading to the determination of optimal control strategies for all parties involved. Therefore, differential game problems have significant theoretical research value and promising application prospects.

The pursuit-evasion game in aerial combat, as a typical instance of differential games, can be considered a multi-agent dynamic game system because the objectives of both sides conflict with each other. Therefore, the choice of optimal control strategies for both aircrafts depends on the respective interests of the pursuer and the evader [6]. Reinforcement Learning (RL), as an important component of machine learning and a hot topic in current research, is an intelligent autonomous learning method. It does not require expert signals or strict mathematical models. Instead, it learns by interacting with the environment through a "trial-and-error" approach. RL continuously tries different behavioral strategies and improves them, adapting to dynamic and unknown environments. As its applications continue to expand, the ability of reinforcement learning to address the performance of multidimensional and complex nonlinear systems has gained increasing attention from researchers and leaders in various fields. Reinforcement learning can adaptively find optimal control methods for complex nonlinear systems [7]. It does not rely on precise mathematical models, is computationally straightforward, and requires relatively little training data. Based on the characteristics of reinforcement learning mentioned above, the control strategies obtained can comprehensively consider interactions between multiple agents and the impact of interactions between agents and the environment. This enables the implementation of adversarial or cooperative behaviors among multiple agents, providing rational, reliable, and dynamic strategic support to multiple agents. Therefore, there is great potential for applications in the field of differential games.

To harness the strengths of both game theory and reinforcement learning methods, some scholars have proposed intelligent agent adversarial algorithms based on reinforcement learning, incorporating RL into the process of modelling adversarial games. Reference [8] applied the Deep Deterministic Policy Gradient (DDPG) algorithm to multi-agent environments and proposed the MADDPG algorithm, which employs a method of decentralized action execution and centralized training of policies. This algorithm exhibits good stability and addresses the issue of high variance in the policy gradient. It enables groups of agents to develop cooperative strategies on both the physical and informational levels in environments characterized by both cooperation and competition. Reference [9] introduces a cooperative learning model called the joint action learner and demonstrates its effectiveness through experiments. Reference [10] uses G2ANet to model the interactions between two intelligent agents. It assesses whether there is interaction between the two agents, and, if so, further evaluates the importance of this type of interaction on the agents' strategies to accelerate the learning convergence speed. Reference [11] introduces mean field game reinforcement learning, which transforms multi-agent problems into problems involving two adjacent agents and replaces the influence of all other agents within the range with an average value. This approach addresses the issue of dimensionality explosion caused by a large number of agents in large-scale problems. Furthermore, there are many different approaches to solving multi-agent RL problems, such as ES-Q(Q learning based on experience sharing) [12], Pareto-Q(Pareto-Q learning) [13] algorithms, and others, which offer diverse perspectives and methods.

The M3DDPG algorithm [14] incorporates the minimax theorem from game theory into the MADDPG algorithm, enhancing the robustness of the algorithm and achieving good results in multiagent environments. However, the M3DDPG algorithm still faces problems of sparse sample data and instability in convergence caused by randomisation. In response to the important characteristics of intelligent agents that are difficult to adapt to complex dynamic environments and perceive the environment, this paper introduces the Maximum Minimum Multi Agent Deep Deterministic Policy Gradient (M3DDPG) algorithm. To improve the learning efficiency and convergence speed of the algorithm, particle swarm optimization is introduced to search and optimize the experience sample set of M3DDPG algorithm to a certain extent, and good training samples are obtained.

In summary, we use the M3DDPG algorithm as a foundation for improvement and employ the Multi-Agent Adversarial Learning (MAAL) approach to address the issue of excessive computation in continuous action spaces. The particle swarm optimisation (PSO) algorithm is introduced to optimise and update the set of sample experiences. As a result, this study proposed an enhanced PSO-M3DDPG algorithm. The main contributions are as follows.

  • Using the MAAL approach to construct local linear functions to approximate the nonlinear state value function, employing gradient descent methods to approximate the objective instead of inner-loop minimisation methods, reducing computational complexity.

  • Introducing the PSO algorithm to optimise the sample data set in order for parameter optimisation, addressing issues related to local optima and unstable convergence.

Problem description and modelling

Modelling the multi-agent pursuit-evasion task for UAVs, utilising the PSO-M3DDPG algorithm as the decision-maker unit for the chasing UAV cluster. These UAVs cooperate with each other to effectively perform the task of pursuing multiple escaping UAVs in a complex battlefield environment.

Battlefield environment

This study builds up a two-dimensional continuous battlefield as the task environment for the many-to-many pursuit and escape problem of UAVs. The length of the battlefield is L and the width is W. Setting the task scenario, \(n(n>m)\) pursuit UAVs hunt \(m(m>1)\) escaping UAVs. \(P=\{{P}_{1},{P}_{2},\cdots ,{P}_{n}\}\) represents the collection of \(n\) pursuit UAVs, \(E=\{{E}_{1},{E}_{2},\cdots {E}_{n}\}\) represents \(m\) escaping UAVs. At the beginning of the mission, the initial positions and velocities of both pursuing UAVs and escaping UAVs are randomly initialized. We set \({d}_{E}\) as the radius of the safety zone established around escaping UAVs. If a pursuing UAV enters this zone, it successfully completes its tracking mission. When a pursuing UAV or an escaping UAV flies out of the battlefield boundary, the mission is considered a failure. The Fig. 1 illustrates a conceptual representation of a four-to-two pursuit-evasion UAV mission.

Fig. 1
figure 1

The four-to-two pursuit-evasion game of UAVs

Motion model

The UAV is simplified to the mass-point model, and its motion state is determined by its position and velocity, as depicted in Fig. 2.

Fig. 2
figure 2

UAV motion state diagram

The instantaneous status information \({Q}_{t}^{i}\) of UAVs \(i\) at the current moment \(t\) is represented as:

$$ Q_{t}^{i} = \left[ {x_{t}^{i} ,y_{t}^{i} ,v{}_{t}^{i} ,\alpha_{t}^{i} } \right]^{{\text{T}}} $$
(1)

In the equations: \({x}_{t}^{i}\) and \({y}_{t}^{i}\) are the position coordinates of the UAV i at time t; \({v}_{t}^{i}\) is the speed magnitude of the UAV i at time t; \({\alpha }_{t}^{i}\) is the angle between the speed direction of the UAV i at time t and the positive X axis direction, known as the heading angle, defined as positive when rotating counterclockwise from the positive X axis direction.

By utilizing linear acceleration \({a}_{vt}^{i}\) and angular acceleration \({a}_{\alpha t}^{i}\) to control the speed and direction of the UAV, maneuverable flight can be achieved, as shown in Fig. 3.

Fig. 3
figure 3

Acceleration control of unmanned aerial vehicles

The instantaneous status information of the UAV i at the moment \(t+1\) is as follows:

$$ \left\{ {\begin{array}{*{20}l} {v_{t + 1}^{i} = v_{t}^{i} + a_{vt}^{i} \cdot \Delta t} \\ {\alpha_{t + 1}^{i} = \alpha_{t}^{i} + a_{\alpha t}^{i} \cdot \Delta t} \\ {x_{t + 1}^{i} = x_{t}^{i} + v_{t + 1}^{i} \cdot \cos \alpha_{t + 1}^{i} \cdot \Delta t} \\ {y_{t + 1}^{i} = y_{t}^{i} + v_{t + 1}^{i} \cdot \sin \alpha_{t + 1}^{i} \cdot \Delta t} \\ \end{array} } \right. $$
(2)
$$ Q_{{t{ + 1}}}^{i} = \left[ {x_{{t{ + 1}}}^{i} ,y_{{t{ + 1}}}^{i} ,v{}_{{t{ + 1}}}^{i} ,\alpha_{{t{ + 1}}}^{i} } \right]^{{\text{T}}} $$
(3)

In the formula: \(\Delta t\) is the simulation step size.

Task allocation model

The task allocation model for the many-to-many pursuit-evasion game involves modeling the task assignment for the UAVs of both pursuing parties. The task assignment is achieved using the Apollo-Apollonius circle design advantage function.

The positions of the pursuing UAV i is denoted as \(\left({x}_{{P}_{i}},{y}_{{P}_{i}}\right)\), and the positions of the escaping UAV j is represented \(\left({x}_{{E}_{j}},{y}_{{E}_{j}}\right).\) The speed ratio between them is given as \(\frac{{v}_{{E}_{j}}}{{v}_{{P}_{i}}}={k}_{ij}<1\). With this information, the coordinates of the Apollonius circle can be calculated as \(\left(\frac{{x}_{{E}_{j}}-{x}_{{P}_{i}}{k}_{ij}^{2}}{1-{k}_{ij}^{2}}, \frac{{y}_{{E}_{j}}-{y}_{{P}_{i}}{k}_{ij}^{2}}{1-{k}_{ij}^{2}}\right).\) The radius of the Apollonius circle can be determined as \(\left(\frac{{k}_{ij}\sqrt{{\left({x}_{{E}_{j}}-{x}_{{P}_{i}}\right)}^{2}+{\left({y}_{{E}_{j}}-{y}_{{P}_{i}}\right)}^{2}}}{1-{k}_{ij}^{2}}\right)\).

The advantage function can be defined as follows:

$$ X_{ij} (x) = \frac{{x_{{E_{j} }} - k_{ij}^{2} x_{{P_{i} }} - k_{ij} \sqrt {(x_{{E_{j} }} - x_{{P_{i} }} )^{2} + (y_{{E_{j} }} - y_{{P_{i} }} )^{2} } }}{{1 - k_{ij}^{2} }} $$
(4)

The task allocation process can be described as n pursuit UAVs completing the pursuit of m escaping UAVs. At any given time, only one pursuit UAV can pursue a single escaping UAV. Each escaping UAV is pursued by at least one pursuit UAV. Task allocation is carried out during the initial stage, and remains unchanged until the end of the mission. The following assumptions are made:

$$ a_{ij} = \left\{ \begin{gathered} 0{\text{ The i-th pursuit drone did not perform the j-th task}} \hfill \\ 1{\text{ The i-th pursuit drone was assigned to perform the j-th task}} \hfill \\ \end{gathered} \right. $$
(5)

Assign tasks to the pursuit and escape UAVs of both sides, as shown below, with a formula of 0–1 planning (Fig. 4):

$$ \, s.t.\left\{ \begin{gathered} \sum\limits_{i = 1}^{n} {a_{ij} = 1 \quad j{ = 1,2} \cdots {,}m} \hfill \\ \sum\limits_{j = 1}^{m} {a_{ij} = 1 \quad i = 1,2, \cdots ,n} \hfill \\ \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{m} {a_{ij} = m = n} } \hfill \\ a_{ij} = 0,1 \hfill \\ \end{gathered} \right. $$
(6)
Fig. 4
figure 4

The Apollonius circle formed by two-to-two pursuit

The overall objective function is as follows:

$$ V(x) = \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{m} {a_{ij} X_{ij} } } $$
(7)

The optimal task allocation is as follows:

$$ A_{\tau }^{*} = \arg \max V(x) $$
(8)

Escape strategy

Drawing on the ideas learned in the course, three gradually complex and intelligent escape strategies have been designed to facilitate progressive training of drones for pursuit. The escaping UAVs employ three different maneuvering strategies: straight-line motion, curved motion, and intelligent evasion motion. Straight-line motion: Escaped UAVs perform variable-speed straight-line motion. Curved motion: escaping UAVs are set to follow a sinusoidal curve motion pattern within the mission scenario. Intelligent evasion motion: when multiple pursuing UAVs enter the detection range of an escaping UAV, the escaping UAV will move in a direction perpendicular to the geometric center of the pursuing UAV cluster. In each round of algorithm training, the escape UAV randomly adopts a motion mode for maneuvering. Three progressively intelligent escape strategies are expressed as follows (Figs. 5, 6, 7).

Fig. 5
figure 5

The model diagram of straight-line motion for escape UAV

Fig. 6
figure 6

The model diagram of curve motion for escape UAV

Fig. 7
figure 7

The model diagram of intelligent escape motion for escape UAV

Straight-line motion:

$$ \left\{ \begin{gathered} v^{\prime}_{target} = v_{target} + a_{v} \cdot \Delta t \hfill \\ \theta^{\prime}_{target} = \theta_{target} \hfill \\ \end{gathered} \right. $$
(9)
$$ a_{v} \in \left[ { - \frac{\pi }{6}, - \frac{\pi }{6}} \right]a_{\theta } = 0 $$
(10)
$$ \left\{ \begin{gathered} x_{target}^{\prime } = x_{target} + v^{\prime}_{target} \cdot \Delta t \cdot \cos \theta_{target} \hfill \\ y_{target}^{\prime } = y_{target} + v^{\prime}_{target} \cdot \Delta t \cdot \sin \theta_{target} \hfill \\ \end{gathered} \right. $$
(11)

Curved motion:

$$ y_{target}^{\prime } = y_{target} + k \cdot \sin (\frac{1}{k}(x_{target}^{\prime } - m) $$
(12)
$$\begin{aligned} \theta _{{target}}^{\prime } &= \arctan \left( { - \cos \left( {\frac{1}{k}\left( {x_{{target}} - m} \right)} \right.} \right) \end{aligned}$$
(13)
$$ \left\{ \begin{gathered} x_{target}^{\prime } = x_{target} + v_{target} \cdot \Delta t \cdot \cos \theta^{\prime}_{target} \hfill \\ y_{target}^{\prime } = y_{target} + k \cdot \sin (\frac{1}{k}(x_{target}^{\prime } - m) \hfill \\ \end{gathered} \right. $$
(14)
$$ \begin{gathered} a_{v} = 0 \, \hfill \\ a_{\theta } = (\theta^{\prime}_{target} - \theta_{target} )/\Delta t \hfill \\ \end{gathered} $$
(15)

Intelligent evasion motion:

$$ \begin{gathered} x_{center} = \frac{{x_{1} + x_{2} + \cdots x_{n} }}{2} \hfill \\ y_{center} = \frac{{y_{1} + y_{2} + \cdots y_{n} }}{2} \hfill \\ \end{gathered} $$
(16)
$$ \theta_{tar - uav} = \arctan ((y_{target} - y_{center} )/(x_{target} - x_{center} )) $$
(17)
$$ \theta^{\prime}_{target} = \frac{\pi }{2} + \theta_{tar - uav} $$
(18)
$$ a_{v} \in \left[ {0,\frac{\pi }{6}} \right]{\text{ }}a_{\theta } = {{\left( {\frac{\pi }{2} + \theta _{{tar - uav}} - \theta _{{target}} } \right)} \mathord{\left/ {\vphantom {{\left( {\frac{\pi }{2} + \theta _{{tar - uav}} - \theta _{{target}} } \right)} {\Delta t}}} \right. \kern-\nulldelimiterspace} {\Delta t}} $$
(19)

Here t represents the simulation step. \({x}_{target},{y}_{target}\) represents the coordinates of the escaping UAV. \({{x}{\prime}}_{target},{{y}{\prime}}_{target}\) represents the coordinates of the escaping UAV at the next time step. \({v}_{target},{\theta }_{target}\) represents the magnitude and direction of the escaping UAV's velocity, where the direction is the angle between the velocity direction and the positive X-axis. \({{v}{\prime}}_{target},{{\theta }{\prime}}_{target}\) represents the magnitude and direction of the escaping UAV's velocity at the next time step. \({a}_{v},{a}_{\theta }\) represents the linear acceleration and angular acceleration of the escaping UAV. \(k\) and \(m\) are parameters influencing the curvature of the curve motion. \({x}_{center}\) center and \({y}_{center}\) are the coordinates of the geometric center of the pursuing UAV cluster. \({\theta }_{tar-uav}\) is the angle between the line connecting the escaping UAV and the geometric center of the pursuing UAV cluster within the detection range.

Algorithm design

MADDPG

The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [15] is an algorithm used to address reinforcement learning in multi-agent environments where agents interact with each other. The core idea of this algorithm is distributed execution and centralized training. In other words, each agent uses its own state to estimate policies and output actions. It optimizes the policy network through Q-values. Additionally, it uses joint states to estimate values and output Q-values, optimizing the action-value function through environmental rewards.

MADDPG is a framework-based reinforcement learning algorithm, with each agent corresponding to one framework. The algorithm model consists of two parts: the actor and the critic. The actor is responsible for policy estimation, while the critic is responsible for value estimation. The shared information among all agents is the joint state action space. For an agent, the basic process to achieve the optimal strategy is as follows: First, the actor network selects a policy based on its input state, which yields an action. Then, the critic network calculates the agent's action value, i.e., the Q-value, based on the joint state-action information. Finally, the critic network performs value estimation and optimizes the value based on environmental feedback, while the actor network simultaneously performs policy estimation and optimizes the policy based on the value. This cycle continues, ultimately leading to the maximum value and the optimal strategy for all agents.

During the training phase, a batch of s samples, denoted as \(S\left( {x^{j} ,a^{j} ,r^{j} ,s_{j}^{\prime } } \right)\) is randomly sampled from the sample buffer \(D\) and fed into the critic network [15]. This process calculates the Q-values as follows:

$$ y^{j} = \left. {r_{i}^{j} + \gamma Q_{i}^{\mu } \left( {x^{\prime j} ,a_{1}^{\prime } , \cdots ,a_{n}^{\prime } } \right)} \right|_{{a_{k}^{\prime } = \mu_{k}^{\prime } \left( {o_{k} } \right)}} $$
(20)

The loss function for the critic network is as follows:

$$ L\left( {\theta_{i} } \right) = \frac{1}{s}\sum\nolimits_{j} {\left[ {\left( {Q_{i}^{\mu } (x^{j} ,a_{1}^{j} , \cdots ,a_{n}^{j} } \right)\left. { - y^{j} } \right)^{2} } \right]} $$
(21)

Then, the parameters of the critic network are updated through gradient descent using the following update formula.

$$ \theta^{\prime}_{critic} = \theta_{critic} - \alpha \frac{{\partial (y^{\prime} - y)}}{{\partial \theta_{critic} }} $$
(22)

After receiving the Q values, the actor network is trained and updated using the gradient update formula as follows.

$$\begin{aligned} \nabla_{{\theta_{i} }} J(\mu_{i} ) &= \frac{1}{s}\sum\nolimits_{j} {\nabla_{{\theta_{i} }} } \mu_{i} (\left. {a_{i} } \right|o_{i}^{j} )\nabla_{{a_{i} }} \left. {Q_{i}^{\mu } (x^{j} ,a_{1}^{j} ,...,a_{n}^{j} )} \right|a_{i}\\ & = \mu_{i} (o_{i}^{j} ) \end{aligned}$$
(23)

Since the goal of the actor network is to maximise the Q values, it is updated by gradient ascent, and the update formula is as follows [16]:

$$ \theta^{\prime}_{actor} = \theta_{actor} + \beta \nabla_{{\theta_{i} }} J(\mu_{i} )\log \pi_{{\theta_{actor} }} (s_{t} ,a_{t} ) \cdot y $$
(24)

where α and β are the weighting parameters.

Max–min principle

The application of the max–min principle in reinforcement learning algorithms involves considering, in the worst-case scenario, finding the maximum value of the minimum value of the Q function [17]. To learn more robust strategies, it is assumed that other agents make the most unfavourable actions for themselves, meaning that other UAVs that pursue adopt actions with the minimum Q-values. This optimization of the agent's cumulative rewards enhances the robustness of the strategy. This results in the formation of the minimax learning objective \({J}_{M}({\theta }_{i})\):

$$\begin{aligned} \nabla_{{\theta_{i} }} J_{M} (\theta_{i} ) &= E_{x\sim D} [\nabla_{{\theta_{i} }} \mu_{i} (O_{i} )\nabla_{{a_{i} }} Q_{M,i}^{\mu } \\ & \qquad (x,a_{1}^{*} , \ldots a_{i}^{*} , \ldots a_{N}^{*} )] \end{aligned}$$
$$ a_{i} = \mu_{i} (O_{i} ) $$
$$ a_{i \ne j}^{*} = \arg \min_{{a_{i \ne j} }} Q_{M,i}^{\mu } (x,a_{1} ,a_{2} \ldots a_{N} ) $$
(25)

The critic network is updated by minimizing the estimation error. The loss function is as follows:

$$ L(\theta_{i} ) = E_{x,a,r,x^{\prime}\sim D} [(Q_{M,i}^{\mu } (x,a_{1} ,a_{2} \ldots a_{N} ) - y)^{2} ] $$
$$ y = r + \gamma Q_{M,i}^{\mu ^{\prime}} (x^{\prime},a_{1}^{*^{\prime}} , \ldots a_{i}{\prime} , \ldots a_{N}^{*^{\prime}} ) $$
$$ a{\prime}_{i} = \mu{\prime}_{i} (O_{i} ) $$
$$ a_{i \ne j}^{*^{\prime}} = \arg \min_{{a_{i \ne j} }} Q_{M,i}^{\mu ^{\prime}} (x^{\prime},a{\prime}_{1} ,a{\prime}_{2} \ldots a{\prime}_{N} ) $$
(26)

The actor network is updated using the sampled policy gradients to optimize its parameters. The optimization formula is as follows:

$$ \nabla_{{\theta_{i} }} J \approx \frac{1}{S}\sum\nolimits_{k} {\nabla_{{\theta_{i} }} \mu_{i} } (o_{i} )\nabla_{{a_{i} }} Q_{M,i}^{\mu } (x^{k} ,a_{1}^{*} , \cdots ,a_{i} , \cdots ,,a_{N}^{*} ) $$
$$ a_{i} = \mu_{i} $$
(27)

Multi-agent adversarial learning

Using the max–min principle for objective solving, the continuous action space and non-linear Q function result in a tremendous computational load. The introduction of Multi-Agent Adversarial Learning (MAAL) methods involves approximating the nonlinear state-value function, i.e., the Q-function, by constructing local linear functions. This approach replaces the inner-loop minimization method with a one-step gradient descent approximation to effectively address the problem.

Introducing a set of perturbations \(\varepsilon \) to disrupt the behavior \({a}^{*}\) that minimizes the Q-value, and linearizing the Q-function \(Q_{M,i}^{\mu } \left( {x,\;a^{\prime}, \cdots a^{\prime}_{n} } \right)\). A perturbation value \({\varepsilon }_{j}\) is sought that can locally approximate the Q-function in the gradient direction. Then, by taking a small gradient step, the value of this perturbation, which leads to the behavior \({a}^{*}\) that minimizes the Q-value reduction, is approximated as expressed in the following equation:

$$ a_{{j \ne i}}^{{*^{\prime } }} = a_{{j \ne i}} ^{\prime } + \varepsilon _{{j \ne i}} $$
$$ \varepsilon _{{j \ne i}} = \arg \min _{{ \in j \ne i}} Q_{{M,i}}^{{\mu ^{\prime } }} (x^{\prime } ,a^{\prime } _{1} + \varepsilon _{1} , \ldots a^{\prime } _{i} \ldots a^{\prime } _{N} + \varepsilon _{N} ) $$
$$ \widehat{{\varepsilon _{{j \ne i}} }} = - \alpha \nabla _{{a_{i} }} Q_{{M,i}}^{{\mu ^{\prime } }} (x,a_{1} ^{\prime } , \ldots a_{i}^{{}} , \ldots a_{N} ^{\prime } ) $$
(28)

where α represents an adjustable coefficient that can influence the step size of the gradient descent solver. A smaller α results in a smaller step size, which can improve computational precision but may make it more challenging to find an appropriate perturbation value. Conversely, a larger α leads to a larger step size but may result in poor performance of the linear fitting function, which is not conducive to effective training and learning.

Minimax multi-agent deep deterministic policy gradient

The M3DDPG algorithm, proposed as an improvement on the aforementioned algorithm, addresses the issue of deep reinforcement learning agents that are often fragile and sensitive to the training environment, especially in multi-agent scenarios [14].

To learn robust policies, the M3DDPG algorithm introduces the maximin principle, assuming that other agents make the most disadvantageous decisions for one's own agent. It also employs a multi-agent adversarial learning approach to reduce the significant computational complexity associated with nonlinear Q-function maximization and minimize. This enhances the robustness and convergence of the M3DDPG algorithm.

The pseudocode for the M3DDPG algorithm is as follows:

Algorithm:
figure a

M3DDPG Algorithm

Particle swarm optimization algorithm

Particle Swarm Optimization (PSO) algorithm is an evolutionary computing technique initially proposed by Eberhart and Kennedy in 1995 [18]. PSO is the only evolutionary algorithm that does not involve survival of the fittest. Due to its simplicity and low computational cost, it has been successfully applied to a range of continuous optimization problems. In the particle swarm algorithm, each particle represents a potential solution to the problem. Optimization problems are solved through the simple behaviors of individual particles and the exchange of information within the population. In the context of RL problems, each particle represents a candidate policy and, through iterations, the PSO aims to find the optimal policy [19]. In each iteration, particles update their velocities and positions by tracking two "extremes": One is the best solution found by the particle itself (pbest), and the other is the best solution found by the entire population (global best or gbest).

The update process for particles is as follows (Fig. 8).

Fig. 8
figure 8

Update method of the particle

In the diagram, \(\overrightarrow{x}\left(t\right)\) represents the particle's position at time t, \(\overrightarrow{x}(t+1)\) represents the particle's position at the next time step, \(\overrightarrow{v}(t)\) represents the particle's velocity at time t, \(\overrightarrow{v}(t+1)\) represents the particle's velocity at the next time step, \(\overrightarrow{p}\left(t\right)\) represents the best solution found by the particle at time t, \(\overrightarrow{g}(t)\) represents the historical best solution found by the entire particle swarm up to time t.

Each particle determines its own velocity and adjusts its trajectory based on its individual experience \(\overrightarrow{p}(t)\) and the collective experience \(\overrightarrow{g}(t)\) of the group. They move towards the optimal point. Different particles calculate their individual fitness values based on the corresponding objective function and assess their own quality [20]. The update formulas for particle velocity and position are as follows:

$$ \begin{gathered} v_{id} \left( {t + 1} \right) = wv_{id} \left( t \right) + c_{1} r_{1} \left( {p_{i} (t) - x_{id} \left( t \right)} \right) \hfill \\ \qquad \qquad \quad \qquad \quad + c_{2} r_{2} \left( {g(t) - x_{id} \left( t \right)} \right) \hfill \\ \end{gathered} $$
(29)
$$ x_{id} \left( {t + 1} \right) = x_{id} \left( t \right) + v_{id} \left( {t + 1} \right) $$
(30)

In the equations above:\(w\) is the Inertia weight, which controls the change in particle velocity.\({r}_{1},{r}_{2}\) express the random numbers between [0, 1], used to control the weight.\({c}_{1},{c}_{2}\) are learning factors representing the random acceleration weights by which particles move towards their individual and global best values, respectively.

The PSO-M3DDPG algorithm

The M3DDPG algorithm is capable of conducting rapid local exploration and acquiring a significant amount of sample data. However, due to the sparsity of the sample space, it is prone to getting stuck in local optima or facing challenges in convergence. Therefore, by using the PSO algorithm to initialise multiple policy networks, creating a population of policy networks that interact with the environment, generating sample data, storing them in a buffer and applying them to train the M3DDPG algorithm, continuously improving the set of sample experience, and combining their strengths, it becomes possible to effectively address the problems inherent to each algorithm. This enables them to learn more efficiently which using PSO for the strategy improvement that is called the PSO-M3DDPG algorithm.

To address the issue of the enormous computational burden that arises when solving the objective in the M3DDPG algorithm using the principle of minimax optimization, a multi-agent adversarial approach is introduced. This approach involves linearly approximating the complex Q function to obtain a simpler linear function. Furthermore, improvements are made to the inner-loop minimization process by replacing it with a one-step gradient descent method. This modification significantly reduces the computational load and achieves algorithm optimization.

The workflow of the PSO-M3DDPG algorithm is as follows (Fig. 9):

Fig. 9
figure 9

PSO-M3DDPG algorithm structure diagram

Step 1: Initialize \(N\) policy network populations and M3DDPG network parameters.

Step 2: Compute the cumulative reward \(R\) for all policies within the population and store the transitions \(({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})\).

Step 3: Agents interact with the environment based on decisions made by deep neural networks, completing one episode of operation.

Step 4: Rank the policy networks based on the cumulative reward provided by the environment as the fitness value.

Step 5: Select the top \(\varphi \text{\%}\) of policy networks as elites.

Step 6: Add random noise to the remaining policy networks to induce mutations.

Step 7: Store the transitions \(({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})\) for the mutated policy networks.

Step 8: Train the M3DDPG network using the acquired experience dataset from the samples.

Step 9: Copy the M3DDPG network parameters to the policy network population.

The structure diagram of the PSO-M3DDPG algorithm is as follows:

The pseudocode for the PSO-M3DDPG algorithm:

Algorithm:
figure b

PSO-M3DDPG Algorithm

figure c

Multi-agent pursuit and evasion strategies for UAVs based on the PSO-M3DDPG algorithm

State space

In multi-agent pursuit-evasion decision tasks for UAVs, the local observation of the chasing UAVs includes their own state information, local interaction information, and the state information of the evading UAVs. The self-state information of a UAV can be described as \(\left({x}_{i},{y}_{i},{v}_{i},{\theta }_{i},team\right),\) where \({x}_{i}\) and \({y}_{i}\) represent the position information, \({v}_{i}\) is the speed magnitude, \({\theta }_{i}\) is the speed direction, and 'team' indicates whether the UAV \(i\) is part of a pursuit team, with 'team' taking values \(\text{0,1}\).

The local interaction information includes data from the three nearest UAVs within communication range, expressed based on their relative distances \(({x}_{k},{y}_{k},{v}_{k},{\theta }_{k})\), \(({x}_{l},{y}_{l},{v}_{l},{\theta }_{l})\), and \(({x}_{m},{y}_{m},{v}_{m},{\theta }_{m})\), representing the positions, speed magnitudes, and speed directions of nearby friendly UAVs. When there are not enough other pursuing UAVs within communication range, the information content is filled with zeros.

In multi-UAV multi-pursuer evasion decision tasks, the state information of evading UAVs can be represented as \(\left({x}_{tar}^{j},{y}_{tar}^{j},{v}_{tar}^{j},{\theta }_{tar}^{j}\right),j=\text{1,2},\cdots \), where \(j=\text{1,2},\cdots \), represents different evading UAVs, respectively indicating the position, speed magnitude, and direction of each evading UAV.

The complete state space is as follows (Fig. 10).

Fig. 10
figure 10

State space for many-to-many pursuit and escape missions of UAVs

Action space

For the control of UAV motion, an acceleration-based approach is employed, where the UAV's action at each step consists of velocity acceleration \({a}_{v}\) and angular acceleration \({a}_{\alpha }.\) This forms a two dimensional motion space represented as \(\left({a}_{v},{a}_{\alpha }\right),\) as shown in Fig. 11.

Fig. 11
figure 11

Action space of UAV

Network model

The particle swarm algorithm optimized M3DDPG algorithm is applied to decision-making in multi-UAV pursuit-evasion tasks. Both the actor network and the critic network are constructed with 4 layers of fully connected neural networks, and the specific number of neurons in each layer is shown in Fig. 12.

Fig. 12
figure 12

Neural network structure diagram

Reward function

In multi-UAV pursuit-evasion tasks, the primary considerations revolve around the completion of the pursuit task and the collaborative requirements among the pursuit team. Regarding the completion of the pursuit task, two types of guiding global rewards are designed based on distance and direction, along with two local rewards for successful capture and task failure. In terms of the collaborative requirements among the pursuit team, two local rewards are designed for forming a pursuit team and avoiding collisions among UAVs.

The reward function for pursuing UAV i in the multi-UAV pursuit-evasion task is defined as follows:

$$ r_{i} = r_{global}^{i} + r_{local}^{i} $$
(31)

The configuration of the global reward \({r}_{golbal}\) is as follows:

$$ r_{global}^{i} = r_{d}^{i} + r_{a}^{i} $$
(32)

In the global reward \({r}_{global}\), \({r}_{d}\) represents the reward generated by the relative distance change between pursuing UAVs and evading UAVs, and its expression is as follows:

$$ r_{d}^{i} = \beta \cdot (dis^{i} - dis^{i} \_) $$
(33)

\({r}_{a}\) represents the directional guidance reward, and its expression is as follows:

$$ r_{a}^{i} = \gamma \cdot \cos \varphi $$
(34)

where \(dis\) represents the current–time relative distance;\({dis}_{-}\) represents the next-time relative distance;\(\varphi \) represents the angle between the velocity vector of the pursuing UAV and the line connecting both the pursuing and evading UAVs' positions;\(\beta \) and \(\gamma \) are hyperparameters representing weight coefficients.

The expression for the local reward, \({r}_{local}\), is as follows:

$$ r_{local}^{i} = r_{final}^{i} + r_{bound}^{i} + r_{team}^{i} + r_{danger}^{i} $$
(35)

The expression for the task completion reward, \({r}_{final}\), which represents the reward value for a UAV successfully capturing a single escaping UAV, is as follows:

$$ r_{final}^{i} = \left\{ {\begin{array}{*{20}l} {20 \quad successfully \, captured \, target} \hfill \\ {0 \quad other} \hfill \\ \end{array} } \right. $$
(36)

The expression for the boundary reward/punishment, \({r}_{bound}\), which is used to assess whether both the pursuing and escaping UAVs have flown out of the mission area, signifying a mission failure, is as follows:

$$ r_{{_{bound} }}^{i} = \left\{ {\begin{array}{*{20}l} { - 20 \quad pursuit \, drone \, flies \, out \, of \, the \, mission \, area} \hfill \\ { - 20 \quad escape \, drone \, flies \, out \, of \, the \, mission \, area \, } \hfill \\ \end{array} } \right. $$
(37)

The expression for the team reward, \({r}_{team}\), which is used to determine whether the pursuing UAV has formed a sub-pursuit team and has been assigned a pursuit mission, and provides a positive reward when pursuing UAV \(i\) forms a sub-pursuit team, is as follows:

$$ r_{team}^{i} = \left\{ {\begin{array}{*{20}l} {10 \quad drones \, form \, a \, pursuit \, team} \hfill \\ {0 \quad other} \hfill \\ \end{array} } \right. $$
(38)

The expression for the danger reward, \({r}_{danger}\), which represents the reward or penalty for collisions between pursuing UAV and is used to ensure that pursuing UAVs maintain a safe distance from each other, is as follows:

$$ r_{{_{danger} }}^{i} = \left\{ {\begin{array}{*{20}c} { - 20\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {} & {} \\ \end{array} } & {} \\ \end{array} } & {} & {} & {} \\ \end{array} } & \quad {if \, d_{ij} \le d_{danger} } \\ \end{array} } \\ {\alpha_{danger} \left( {d_{safe} - d_{ij} } \right)\begin{array}{*{20}c} {} & \quad {if \, d_{danger} < d_{ij} \le d_{safe} } \\ \end{array} } \\ {0\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {} & {} & {} \\ \end{array} } & {} & {} & {} \\ \end{array} } & {} & {} & {} \\ \end{array} } & {other} \\ \end{array} } \\ \end{array} } \right. $$
(39)

where \({\alpha }_{danger}\) represents the weight coefficient.

Simulation experiments

Simulation parameters are set as follows (Table 1).

Table 1 UAVs pursuit-evasion task training parameters

Training process

Using both the basic M3DDPG algorithm and the improved PSO-M3DDPG algorithm as decision units for pursuing UAVs, model training is conducted for the task of multi-UAV pursuit-evasion game. In each training round, the initial state of the UAVs is randomly initialized. The different initial scenarios for the multi-UAV pursuit and evasion task examples are illustrated in Figs. 13, 14.

Fig. 13
figure 13

Initial situation map of UAVs two-to-two pursuit and escape mission

Fig. 14
figure 14

Initial situation map of UAVs four-to-two pursuit and escape mission

After initialization, in each round of the training process, the escape UAV randomly adopts one of the maneuvering modes of linear motion, curved motion, or intelligent motion to escape, and the training effects are introduced below.

An analysis of the convergence of artificial neural network parameters is conducted. The following figure shows the numerical changes in the mean and variance of network weight parameters in the 'actor_eval' neural network of both algorithms as the training epochs process, as shown in Fig. 15.

Fig. 15
figure 15

Mean and variance variations of 'actor_eval' network weight parameters

From the above figure, it can be seen that in the initial training of the M3DDPG algorithm, due to the random initialization of neural network parameters following a normal distribution, it is prone to local optima during the pursuit-evasion decision-making process. Therefore, in the training process, there is a large optimization range for the parameters, resulting in slow convergence. On the other hand, in the training of the PSO-M3DDPG algorithm, the use of the PSO algorithm to optimize the strategy network population that generates sample data, combined with the M3DDPG algorithm for exploration, results in a better sample data set that is applied to algorithm training. In the neural network updates, the overall parameter optimization range is smaller, and the convergence speed is significantly faster. As the learning process progresses, the neural network parameters gradually approach their optimal values until they converge, reaching a stable state and obtaining a stable decision-making model for the UAVs behavior.

The UAVs was trained by using both the PSO-M3DDPG algorithm and the M3DDPG algorithm. The mean individual round rewards and the overall rewards for the UAVs were recorded in each training round. These metrics serve as critical indicators of how well the UAVs interacted with the environment. The results are shown in Fig. 16.

Fig. 16
figure 16

Average individual and global rewards during training process

From the above figure, it can be observed that as the training progresses, the rewards gradually increase and eventually converge. However, the initial reward value for the PSO-M3DDPG algorithm is higher than that for the M3DDPG algorithm. Furthermore, the overall learning efficiency and final convergence results are significantly better for the PSO-M3DDPG algorithm compared to the M3DDPG algorithm. This indicates that the use of the PSO algorithm to optimize the sample data set significantly promotes the learning process of neural networks, accelerates the convergence speed of the algorithm, and leads to a better convergence result.

Validation process

Training the pursuit UAVs to engage with escape UAVs that employing different evasion strategies. Validate the performance of the trained neural network models. Employ the converged artificial neural networks as decision-making units for the pursuit UAVs. Conduct multi-UAV pursuit and evasion tasks under varying conditions, including different quantities and initial states of UAVs. Analyze the trajectory of the pursuit UAVs to assess their performance.

When the escape UAVs perform simple straight-line movements, the trajectory diagram is as follows (Figs. 17, 18).

Fig. 17
figure 17

Trajectory diagram of UAVs for two-to-two pursuit and escape mission

Fig. 18
figure 18

Trajectory diagram of UAVs for four-to-two pursuit and escape mission

When the escape UAVs perform simple curved movements, the trajectory diagram is as follows (Figs. 19, 20).

Fig. 19
figure 19

Trajectory diagram of UAVs for two-to-two pursuit and escape mission

Fig. 20
figure 20

Trajectory diagram of UAVs for four-to-two pursuit and escape mission

When the escape UAVs perform complex adversarial movements, the trajectory diagram is as follows (Figs. 21, 22).

Fig. 21
figure 21

Trajectory diagram of UAVs for two-to-two pursuit and escape mission

Fig. 22
figure 22

Trajectory diagram of UAVs for four-to-two pursuit and escape mission

From the motion trajectory diagram above, it can be seen that different numbers of UAVs can effectively completed the pursuit and evasion tasks for targets with different movement patterns, performing well. The target decomposition and task allocation for the pursuit of drones have been designed, so that multiple pursuit drones can form an effective sub team to capture the escaping drones one by one. In response to the problem of falling into local minima caused by unreasonable initial value assignment in neural networks, which may lead to convergence oscillation or non convergence, the particle swarm optimization algorithm and M3DDPG algorithm are combined to search and learn the experience sample set of deep neural networks to a certain extent. The particle swarm algorithm is used to obtain a relatively optimal solution in the overall optimization process, and then the gradient descent of the neural network is used for detailed optimization learning, ultimately obtaining the optimal solution.

Use the improved PSO-M3DDPG algorithm for specific model construction and algorithm design. Through training, use the artificial neural network constructed by the improved algorithm to command the pursuit of drone clusters and gradually achieve the pursuit task of multiple escaping drones. The simulation verified the effectiveness of the improved algorithm as a behavioral decision-making unit for pursuing drones in achieving multi to many pursuit and evasion tasks. Moreover, the improved PSO-M3DDPG algorithm has a faster convergence speed and better decision-making strategy compared to the original algorithm.

Conclusion

This article focuses on the research of multi-UAV pursuit and evasion games [21], and improves upon traditional multi-agent cooperative algorithms [22] based on minimax optimization. It adopts a multi-agent adversarial learning approach for minimax target solving, co mbining the PSO algorithm with the M3DDPG algorithm, proposing the PSO-M3DDPG algorithm. This algorithm utilizes the PSO algorithm to generate and continuously optimize the empirical sample set. Simulation experiments show that compared to the M3DDPG algorithm, this algorithm exhibits faster convergence, better robustness, and achieves a higher success rate in pursuit and evasion tasks.

Currently, most reinforcement learning algorithms are limited to small-scale intelligent agent environments and applied to large-scale cluster control problems, which suffer from dimension explosion and extremely high environmental complexity. The population optimization characteristics of evolutionary algorithms are expected to solve this problem, and future work will pay more attention to the deep combination of evolutionary algorithms and reinforcement learning.