Autonomous maneuver strategy of swarm air combat based on DDPG

Unmanned aerial vehicles (UAVs) have been found significantly important in the air combats, where intelligent and swarms of UAVs will be able to tackle with the tasks of high complexity and dynamics. The key to empower the UAVs with such capability is the autonomous maneuver decision making. In this paper, an autonomous maneuver strategy of UAV swarms in beyond visual range air combat based on reinforcement learning is proposed. First, based on the process of air combat and the constraints of the swarm, the motion model of UAV and the multi-to-one air combat model are established. Second, a two-stage maneuver strategy based on air combat principles is designed which include inter-vehicle collaboration and target-vehicle confrontation. Then, a swarm air combat algorithm based on deep deterministic policy gradient strategy (DDPG) is proposed for online strategy training. Finally, the effectiveness of the proposed algorithm is validated by multi-scene simulations. The results show that the algorithm is suitable for UAV swarms of different scales.


Introduction
Unmanned aerial vehicle (UAV) with the characteristics of low cost, strong mobility, high concealment and no need of pilot control, have been more and more widely used to replace manned aircraft to perform military tasks such as detection, monitoring and target strike, and is a typical representative of "non-contact" combat equipment [1]. Due to the limitations of single UAV's mission and combat capability, the swarms and intelligence of unmanned combat have become a research hotspot in recent years. With the increase of the operating range of airborne detection equipment, the scope of modern air combat has gradually developed from line of sight to beyond line of sight [2]. Swarms beyond visual range air combat refers to the situation assessment [3][4][5], environment awareness [6,7], and maneuver strategy [8] of UAVs through sensing or detection equipment, and maneuver strategy is the basis of the above tasks. The existing air combat maneuver strategies can be divided into two categories: rule-based strategy and learning-based strategy. Rule-based strategy mainly select actions according to the given behavior rules in air combat, and not need online training and optimization, including matrix game algorithm [9,10], expert system [11], influence graph method [12,13], differential game method [14], etc. The matrix game method is prone to the phenomenon that the overall strategy effect is not good. On the one hand, the score function used in this method is difficult to design, which can not accurately discribe the actual air combat. On the other hand, this method has reward delay in the sequential decision making, which does not have the ability of long-term planning. Expert experience is difficult to cover all air combat situations, so is very complex to establish rule base and constraint conditions by using expert system method. Moreover, the UAVs cannot make dicision independently by using expert system method. When UAV can not find the appropriate strategy scheme in the rule base, it must introduce human intervention. In [15] and [16], multi-level influence graph can realize multiple-to-one air combat, but it is only suitable for small-scale UAV swarms. Moreover, influence diagram method relies on prior knowledge, and the algorithm reasoning process is cumbersome, which can not meet the requirements of real-time and high dynamic air combat. The two methods introduced in [14] can't be applied to the air combat scene without model or with incomplete environment information, because these methods need to accurately model and describe the strategy model. In short, rule-based strategies are monotonous and rigid, which can not adapt to the complex and highly dynamic air combat scenarios, and can not meet the requirements of intelligent operations.
Learning-based strategies optimize the model and structural parameters by means of online learning or training data, which mainly includes artificial immune system, genetic algorithm, heuristic learning algorithm, neural network method, deep reinforcement learning (DRL), etc. Artificial immune [17] uses training data to make UAV's maneuver system can deal with different air combat situations, but the convergence is slow. The fuzzy tree proposed in [18] is too complex to be used in beyond visual range air combat. According to the characteristics of beyond visual range air combat, the University of Cincinnati has built the "alpha" intelligent air combat system [19], which has used more than 150 dimensional input data in a two-to-four air combat scene. The data dimension will be further increased if the control of airborne sensors is considered in the future. Therefore, the processing and application of high-dimensional massive data are the main dificulty in multi-UAVs autonomous air combat.
Deep learning (DL) has a good application effect in data processing [20][21][22][23], and reinforcement learning (RL) is widely used in autonomous maneuver [24][25][26]. With the successful application of deep reinforcement learning (DRL) in complex sequential decision making problems such as Alpha-Go [27] and Alpha-Go Zero [28], it is possible to solve air combat maneuver strategy problems by using reinforcement learning. Reinforcement learning is a learning method that uses "trial and error" to interact with the environment [29], which is a feasible method for autonomous decision making of UAV intelligent air combat maneuver strategy. The application of reinforcement learning in air combat is mainly based on value function search [30,31] and strategy search [32,33]. Deep Q network (DQN) algorithm is improved in [32] and realizes the UAV close range one-to-one air combat, but the algorithm uses discrete state and motion spaces, which makes the results of air combat quite different from reality. The Actor-Critic (A-C) framework is used in [34] to realize the continuous expression of UAV maneuver strategy in state space, but the algorithm is only effective in two-dimensional space. In [35], the deep deterministic policy gradient (DDPG) is applied to air combat decision making. However, the design of reward and punishment function is relatively simple, and it is difficult for agents to learn complex tactics. In general, the above studies improve the effect and ability of air combat maneuver strategy algorithm to a certain extent, but ignores the generalization ability of the algorithm. That is , the existing research on UAV air combat based on reinforcement learning algorithm mainly focuses on specific scenarios. Whether it is one-to-one air combat or multi-UAVs cooperation, the number of UAVs on both sides must be fixed, has poor compatibility.
To sum up, the strategy of swarm autonomous maneuver in dynamic environment is an urgent problem to be solved. That is, the maneuver module should have good compatibility, which is suitable for swarms of different sizes. Because the number of UAVs is constantly changing in the process of combat, the amount of information of both sides is dynamically changing.
In this paper, an autonomous maneuver strategy of swarm combat in beyond visual range air combat based on reinforcement learning is proposed. In this strategy, the autonomous maneuver decision and cooperative operation of UAV in swarm are realized, and the scalability of the algorithm is improved. The main contributions are as follows: First, the overall framework of autonomous maneuver strategy methods of UAV swarm is designed. Second, the swarm beyond visual range air combat model is established, the target entry angle and target azimuth are defined, and the situation assessment function is designed. Third, a two-stage maneuver decision strategy is proposed to solve dimension explosion. Two maneuver actions are designed in the first stage to realize the rapid collision avoidance and good communication in swarm. Multi-to-one air combat is transformed into one-to-one air combat in the second stage, avoiding directly dealing with high-dimensional swarm information. Forth, based on the basic principle of reinforcement learning and the requirements of swarms control, the Actor-Critic network framework is designed, the state space and action strategy are given, and the reward function is designed based on the distance to realize the rapid convergence of the algorithm. Fifth, based on memory bank and target network, an algorithm is designed and A-C network is trained to obtain an autonomous cooperative maneuvering strategy method for UAV swarm. The simulation experiments are carried out to verify that the algorithm can be applied to UAV swarms of different scales.
The following parts are arranged as follows: Section 2 gives the algorithm design framework, establishes the model of air combat. Section 3 designs the swarm combat maneuver decision, gives the algorithm flow of swarm air combat based on DDPG. Section 4 conducts simulation analysis. Section 5 summarizes the full paper. 2 Swarm air combat model

Overall framework
The process of UAV swarm combat includes three modules [36]: environment awareness, situation assessment and maneuver decision as shown in Fig. 1. Each UAV obtains the current battlefield information through environment awareness module via data links or airborne sensors, which mainly includes the states of the enemy target and other UAVs. Through the situation assessment module, each UAV assesses its current situation such as "advantageous" or "disadvantageous" situations, based on which the decision on how to maneuver at next time step is made. The three modules form a closed loop, which ultimately achieve the effect of inter-aircraft cooperation and air combat tasks.

Air combat model
In this paper, a multi-to-one air combat is studied as shown in Fig. 2, where multiple UAVs are deployed to monitor and attack one enemy target. In the ground coordinate system Oxyz as shown in Fig. 3, Ox axis takes the east, Oy axis the north, and Oz axis the vertical. Denote the i − th UAV by UAV i and its position at time t by P i,t , where t is the discrete time index under fixed sampling period T. The way-point motion model of the UAVs is given by where v U i,t is the vector from P i,t to P i,t+1 , which is seen as the control input to be designed in the follow parts. The position of target is denoted by P T,t .

Remark 1
In general, the actual position P i,t+1 at time t + 1 may not be exactly equal to P i,t + v U i,t due to the flight interferences or the model uncertainties, even if v U i,t is within the physical control limitations. It is a classical control problem to minimize the error P * i,t+1 − P i,t+1 given an approximated UAV flight model which can be solved by conventional PID or robust control methods, where P * i,t+1 = P i,t + v U i,t is the desired next-time position. This problem is not the focus of our research, and in this paper it is assumed that P i,t = P * i,t holds all the time. The control limitations will be considered in practice to make v U i,t applicable for a specific type of UAV. In the same coordinate system, a multi-to-one air combat model is established as shown in Fig. 4. Denote by ϕ U i ,t the angle between the vectors P T,t − P i,t and v U i ,t , which is named as the target azimuth, and similarly, ϕ T,t for the angle between the vectors P i,t − P T,t and v T,t , which is named as the target entry angle. They can be computed in real time as follows, where " " denotes the transpose operation, In the situation assessment module, we use the situation assessment function to evaluate the real-time situation of UAV. Based on the attack model [34] and evaluation function [37], the effective attack range of a UAV in the air combat is a cone with an axis in the direction of v U i ,t and angle of ϕ m , which is truncated by a ball of radius R W as shown in Fig. 4, where R W represents the attack range of weapons. Similarly, we can define the cone-shape attack range for the target. Heuristically, if UAV i is in the attack range of the target, UAV i is said to be in the "disadvantageous" situation; if the target is in the attack range of UAV i while UAV i is not in the attack range of the target, UAV i is said to be in the "advantageous" situation; otherwise, UAV i is said to be in the "balance" situation. The three different situation assessment results can be denoted by "-1", "1" and "0" respectively, which are defined as follows, where R iT,t = P i,t − P T,t .

Maneuver strategy for air combat
A two-stage maneuver decision strategy is proposed in this paper. In the first stage, each UAV only considers the interactions between vehicles in the swarm by checking the stable communication and safety constraints. If all the constraints are satisfied, then the UAV moves to the second stage of maneuver decision, in which the UAV only considers the situation between the target and itself. In the following part, we will discuss the two stages separately.
In the first stage, we consider two basic principles [37]: 1) UAVs should keep a safe maneuver distance to avoid inter-vehicle collisions; 2) UAVs should stay in a closed ball to ensure stable communication. Thus, we define two maneuver actions: separating and gathering.
Separating If the distance between two UAVs is less than a predefined range d min , there might be a potential threat of collision. That is, if P j,t − P i,t < d min we set where λ 1 and λ 2 are positive weight numbers, which are selected such that λ 1 P i,t − P j,t / λ 2 v i,t−1 is small and v i,t satisfies the speed limitations. Gathering The swarm should be constrained in a ball of a predefined radius R m . For each UAV i , if the condition of separating is not satisfied, then it computes the center of the ball by the average of all UAV positions excluding itself, i.e., where n is the total number of UAVs. If the distance between UAV i and P ci is less than R m , it needs to approach the center by making a small maneuver. Thus, we set where λ 3 and λ 4 are positive weight numbers, and their selection rules are similar to those for λ 1 and λ 2 .
If the UAV does not need to execute the first stage maneuver actions, it will enter the second stage. In the second stage, we consider two principles for combat against target : 1) UAV should escape from the attack range of target if the situation is disadvantageous; 2) UAV should approach the target if the situation is advantageous. Swarm air combat is converted into one-to-one air combat in this stage, we design DDPG algorithm to realize it and explain it in detail in the following section.

One-to-one Maneuver decision algorithm design
Reinforcement learning algorithm obtains the optimal action strategy π * by finding the optimal action value function [38].
where γ is the discount factor, it can control the proportion of future rewards in cumulative rewards. (9) is to look up the table to find the optimal strategy in different states, and it is only suitable for discrete space. In order to solve the autonomous decision making problem in continuous air combat environment, we add neural network and use DDPG algorithm to realize the continuous motion control of UAV. DDPG uses Actor-Critic network to fit the action strategy π and the action value Q π , and the parameters are θ μ and θ Q respectively. Actor network is used to generate maneuver strategy π (s, a, θ μ ), Critic network outputs action value Qπ s, a, θ Q to evaluate π(s, a, θ μ ), and π (s, a, θ μ ) is optimized by optimizing Qπ s, a, θ Q .

Network Structure
The input s t of the Actor network is a vector from P i,t to P T,t , and the output O a is target speed O a and s t are used as the inputs of Critic network, and O c is the evaluation of the results of O a . The DDPG network structure is shown in Fig. 5, and the parameters will be given in the simulation part.
One-to-one Maneuver Design Maneuver a t realizes swarm air combat by controlling v U i ,t , which is generated by A network: In order to realize the agent's exploration of the environment and the optimization of a t , the noise N generated by OU process [39] is usually added to O a to synthesize a t to participate in network training. The function of N and a t are shown in (13) and (14).
where θ and σ are weights, μ represents the mean value, W is Gaussian noise. ξ is an exploration coefficient that decreases with the increase of training rounds. OU process can make exploration more efficient.

Reward Function
The reward function designed in this paper is as follows: When the UAV approaches the P T,t , that is, s t+1 < s t , a positive reward will be given, otherwise a penalty proportional to the distance will be given. Using (15) as reward function, regardless of the initial position of UAVs and enemy, UAVs will eventually fly to the rear of the target to attack it.
The algorithm flow is shown in Algorithm 1. According to P i,t and P T,t , the current state s t is determined. The online Actor network generates O a according to the network parameter θ μ . The OU process generates N and adds it to O a to synthesize the action a t . The UAV moves to the next state and gets a reward r t . Set (s t , a t , r t , s t+1 ) to the memory bank. UAV repeats the above steps many times to collect a large number of samples. When the number of memory samples meets the requirements, random samples are selected to train A-C network. A batch of samples {(s i , a i , r i , s i+1 ) m |m = 1, 2, ...., M} are randomly selected from the memory bank to calculate the target value y i . Initialize the initial state of air combat 4: Receive initial observation state s 1

5:
for t=1,2,...,T do 6: Online A network generates action O a , using (14) generates noise N, get (13) 7: Execute action a t and observe reward and observe new state s t+1 8: Store transition (s t , a t , r t , s t+1 ) in memory bank 9: Sample a random minibatch {(s i , a i , r i , s i+1 ) m | m = 1, 2, ...., M} from memory bank 10: Get Update parameters of online C network by minimizing (17) 12: Update the online Actor policy using (19) 13: Using (20) where γ is the decay rate, the meaning is the same as that in (12), O a (s i+1 : θ μ ) is the output of target A network, representing the target action, Q (s i+1 , O a (s i+1 : θ μ ); θ Q ) is the output of target C network, representing the expected value of taking action O a (s t+1 ) under the state s t+1 , and the parameters of target A and C networks are θ μ , θ Q . Update target network parameters: Gradient descent method is used to minimize (17) and optimize the parameters of Critic network. (18) and (19) are loss function and parameter optimization equation of Actor network. The parameters of the neural network are updated by (20). DDPG uses soft update method to update network parameters. Each time, the parameters are only updated a little, which makes the learning process more stable. With the increase of the number of training rounds, the agent's Maneuver selection in different states tends to be optimal.

Multi-to-one Maneuver strategy design
When (17) is close to 0 or there is no obvious change, the training should be stopped. After the trained neural network is saved, the one-to-one air combat maneuver decision can be obtained. In the following part, we use the DDPG algorithm and two maneuver actions to realize swarm multi-to-one air combat, and the process is as Algorithm 2.
In Algorithm 2, R ij,t , P T,t ,K i,t and P ci,t are put to determine whether the UAV needs to excute "separating" or "gathering". If necessary, get the v U i ,t through (5) and (7); If it is not necessary, enter Algorithm 1 and get v U i ,t .

15:
end for 16: Update the position of all UAVs and target 17: end while Limited by UAV maneuver ability, v U i ,t given by two maneuver actions or neural network makes UAV unable to reach. Assumming that the angle between v U i ,t and v U i ,t−1 is φ, the maximum turning angle of UAV is α as shown in Fig. 3. The maximum speed of UAV is V max , and the minimum speed is V min . The actual v U i ,t is as shown in (22). Then we use (1) When φ ≤ α, v U i ,t is given by (21) and (22) to meet is calculated by (22) and (23), β is the complement of the angle between v U i ,t and v U i ,t − v U i ,t−1 .

Simulation setup
The default parameters of this simulation experiment are shown in Table 1. The velocity of our UAV and target refers to the displacement of UAV in unit step time in the simulation system. The enemy plane moves in a straight line or curve at a constant speed, while our UAV moves in a variable speed. The speed of UAV is determined by the output of the neural network. The position of our UAVs is initialized randomly. The structure of A-C network is shown in Fig. 5. Actor network consists of two layers, and the number of nodes is 400 and 800. In Critic network, the number of common layer nodes that input information is 800, and the others are 400. After unit conversion, the real combat environment of 50km×50km×50km is simulated. The maximum turning angle of UAV is 60 • . The maximum speed is 180km/h. The maximum communication radius is 20km, and the maximum attack distance is 500m.
To test the effectiveness of the proposed algorithm, we define the following three metrics : Compatibility, ϕ U i ,t , R iT,t . Compatibility refers to whether the algorithm can be extended from one-to-one air combat to multi-to-one combat, ϕ U i ,t measures the change of K iT,t and the rationality of maneuvering of UAV in air combat. R iT,t reflects whether the algorithm converges.

Simulation results
The UAV fixed-point arrival capability is simulated and verified, as shown in Fig. 6. Given the target point, whether it is a single UAV or a multi aircraft environment, our aircraft can independently plan the path to reach the target point through DDPG algorithm.
Aiming at the dynamic target, the combat effect of a single UAV is verified first, and the enemy plane moves in a uniform straight line or circular motion. As shown in Fig. 7, our UAV can monitor the position of the target in real time and accurately track it. The change of the distance and angle between UAV and target is shown in Fig. 8.
In the early stage of simulation, the angle between the enemy and UAV is obtuse, so UAV is in an inferiority situation. On the one hand, the UAV is far away from the enemy, on the other hand, it quickly adjusts the azimuth, so UAV situation has changed from an inferiority to an advantage. Then the UAV constantly approaches the target and adjusts the azimuth angle in real time to keep it within the maximum attack angle. Finally, the UAV can lock the target for more than 2s to complete the tracking and attacking of the target. Next, verify the effect of multi Aircrafts cooperative combat, the target still do uniform linear motion or circular motion. The results of swarms cooperative combat are shown in Figs. 9 and 10.
Compared with Fig. 8, due to factors such as aggregation and separation need to be considered in the swarms, the UAV can not all approach the target point at the same time, so the distance between the UAV and the target is close or far. In the process of tracking, UAV constantly adjusts the azimuth angle, and the azimuth angles of several aircraft close to the target are kept in the maximum attack range, showing a dominant situation. It can also be seen from Fig. 10 that all UAVs are kept within the communication range from the swarm center.
In the above simulation process, each simulation step corresponds to a decision making, which is converted to 1s of the real environment. Therefore, the simulation steps of the abscissa in Figs. 8 and 10 can be regarded as time.

Conclusion
A maneuver strategy based on DDPG algorithm is proposed to realize UAV swarm combat. Based on swarm framework, the air combat model and behavior set are designed for UAV to realize autonomous decisiom making. According to the characteristics of DDPG algorithm and task requirements, the distance between UAV and target point and the velocity value of UAV are taken as the input and output of actor network, and the reward function is constructed by relative distance to train the neural network parameters, so that the network converges quickly. A visual simulation environment is built to verify the application effect of the algorithm. The results show that the swarm maneuver strategy based on deep reinforcement learning algorithm can complete the attacking of the target on the premise of clear tasks. No matter the target moves in a straight line or curve, it has good simulation effect. Our UAVs can lock it in the attack range and keep it for a certain period of time, It has a certain practical value.