UAV Intelligent Coverage Navigation Based on DRL in Complex Geometrical Environments

Unmanned aerial vehicle (UAV) is one of the preferred tools for coverage detection missions, because of its maneuverability and flexibility. It is challenging for the UAV to decide a track by itself in a complex geometrical environment. This paper presents a UAV intelligent navigation method based on deep reinforcement learning (DRL). We propose using geographic information systems (GIS) as the DRL training environment to overcome the inconsistency between the training environment and the test environment. We creatively save the flight path in the form of an image. The combination of the knowledge-based Monte Carlo tree search method and local search method can not only effectively avoid falling into local search, but also ensure learning the optimal search direction under the limitation of computing power. Experiments show that the trained UAV can find an excellent flight path by intelligent navigation, and able to make effective flight decisions in a complex geometrical environment.


Introduction
UAV has the advantages of small size, low cost, convenient use, minimal environmental requirements, and flexible aerial surveillance views over a wide area. Thus, it is employed in the field of surveillance, search, rescue, wildlife, border patrol, etc [1][2][3][4]. The UAV's sensor, such as the camera, is susceptible to interference in the complex geometry environment. The primary challenge in a coverage mission is planning the UAV path that effectively covers the given region [5]. These challenges include Coverage sensing quality Many studies on UAV coverage problems assume that the given area has an ideal flat terrain [6]. In reality, most of the target terrain is rugged. The picture pixel quality obtained by the traditional method is inequality [5,7]. Suppose UAV flies on a horizontal plane, as shown in Fig. 1. Photos' resolution taken by UAV camera sensors in red areas is higher, but photos' pixel taken in blue areas is low. UAV needs to adjust its altitude in real time to get good-quality terrain photos. All kinds of obstacles may occlude the cameras' view. To meet sensing resolution requirements, the UAV should be able to vary its flying height, which needs to optimize path planning in the 3D domain [8,9].
Energy constraint and time constraint The UAV should travel through the waypoints with the time constraint [10,11]. Searching for the optimal UAV path to meet the time and energy constraint is a non-deterministic polynomialtime hard (NP-hard) problem. Searching for a near-optimal path with comparable cost and much less search time is usually adopted in the existing UAV path planning algorithms. Finding an optimal flight path is factorial time complexity (O(n!)) [12]. n represents the number of alternative flight path points.
Intelligent real-time navigation UAV intelligent navigation means that UAV can make flight decisions itself based on the environment and coverage tasks. In recent years, deep reinforcement learning (DRL) methods are tried to solve intelligent path planning problems [13,14]. By taking a depth image as the input and control commands as the output, the robot moves and tries to find a suitable path. However, most RGB-D cameras function in a limited range and cannot achieve satisfactory navigation when used as the only sensor for long distances [15]. The DRL methods are not memorizing the maps at the testing stage but, instead, at the training stage [16]. In other words, finding a path in this way that requires flying repeatedly to learn a strategy in the real world is unrealistic.
This paper proposes a sensing quality-aware coverage and intelligent path planning solution for UAV monitoring of geometrically complex scenes with varying altitudes and occlusions. The goal is to provide overview images for a target area with satisfactory spatial and temporal resolutions under UAV energy limits. The main contributions of this paper are as follows: (1) we develop a DRL framework for UAV navigation in large-scale complex environments. We use terrain maps (for instance, Google terrain map, which is very close to the real UAV flying environment.) as models for DRL. The DRL framework makes it easier for us to find the global optimal flight path. After learning and training in the GIS, the UAV can make autonomous flight decisions. (2) The deep neural network (DNN) has a good understanding and analysis ability of images. We take the route UAV already passed by as the current agent observation. We creatively use the projection method to transform the 3D path into a 2D image. (3) We propose the terrain knowledge-based fast evolutionary MCTS (TK-MCTS) method. The TK-MCTS uses the terrain information to guide the UAV to search the unexplored area with a higher probability. (4) According to the terrain coverage task's characteristics, we combine local terrain exploration with global exploration. In this way, the UAV can avoid trapping at a local optimum. At the same time, it can also reduce the global estimation inaccuracy [17].

2D Coverage Path Planning
Early topographic coverage studies assume that target area terrains are ideal planes [18]. 2D path planning model and strategy are widely used, such as the spiral model, the spiral-like model, the Lawnmower model, the Zamboni model, the Dubins path model, and the modified Lawnmower/Zamboni path planning strategy [19]. However, most of the terrain is rugged. Obstacles affect the coverage of UAV camera sensors. If we only consider terrain coverage in 2D space, it is difficult to guarantee sensor quality.

Traditional 3D Coverage Path Planning
With the improvement of computing power, the latest research focuses on 3D space coverage path planning. Dai et al. [8] indicate that more images should be taken than in an ideal flat terrain case to achieve full coverage of all the spots inside a target area. During a coverage mission, [20] plan information-rich trajectories in continuous 3D space by building the terrain maps online to optimize initial solutions obtained. Scott et al. [21] propose an occlusion-aware UAV coverage technique by finding the best set of waypoints for taking pictures in a target area. The selected waypoints are then assigned to the UAV by solving a vehicle routing problem (VRP). Based on the matrix completion, [22] first select the dominator sampling points, and then select the virtual dominator sampling points. Finally, the optimal simulated annealing algorithm is used to plan the path of UAV based on the selected sampling points. In [23], the authors propose a coverage algorithm through the hexagonal tiling of a target region. But this method cannot be extended to 3D space. In [24], the authors develop a general preferencebased multi-objective evolutionary algorithm to converge to preferred solutions, and preferences of a decision maker are elicited through reference point(s). However, this method cannot effectively deal with the environment where obstacles exist. Zhang et al. [25] exploit a newly defined individual cost matrix, which leads to an efficient multiple UAVs path planning algorithm. However, this algorithm is prone to failures in the relatively complex terrain environment. In [26], an integer linear programming formulation of the coverage path planning problem is shown to provide almost optimal strategies at a fraction of the computational cost of brute force methods. Bircher et al. [27] are capable of computing short inspection paths via an alternating two-step optimization algorithm. In this method, viewpoints are first found and then connected to form links. The path obtained by this method is often not the optimal path.

Intelligent 3D Path Planning
DNN has excellent learning ability and memory ability. Recent improvements in DRL have allowed solving problems in many 2D domains, such as Atari games [28,29]. Wang et al. [30], based on the deep Q-Network framework, the raw depth image is taken as the only input to estimate the Q values corresponding to all moving commands. Combining deep learning (DL) with reinforcement learning (RL), complex 3D terrain path planning has achieved better results than before. In recent years, DRL methods have been used in navigation [31,32] and path planning [33][34][35] applications. In [36], DNN is used for real-time path planning. The study focuses on avoiding collisions with obstacles. Planning for autonomous unmanned ground vehicles, [37] proposes constrained shortest path search with graph convolutional neural networks (CNN).

DRL GIS Training Environment
DRL algorithms have demonstrated progress in learning to find a goal in challenging environments. The experiments in [16] show that the DRL algorithms are not memorizing the maps at the testing stage, but, rather, at the training stage. We propose using GIS as the DRL training environment to overcome the inconsistency between the training environment and the test environment [38]. UAV is trained in GIS and has the ability to make intelligent decisions so that the UAV can make accurate decisions in the real world. The steps to establish the GIS terrain training environment include terrain sampling, waypoint generation, and visibility analysis.

Complex GIS Terrain Sampling
We transform the terrain into a discrete set of geographic coordinate points B = {b 1 , ..., b n } , b i = (x i , y i , z i ) . x i denotes longitude, y i denotes latitude, and z i denotes altitude. Along with longitude and latitude on the ground plane, we sample the 2D coordinates on the ground plane with a step size of B s . We can obtain the terrain elevation of each sampling point.

UAV Waypoint Generation and Visibility Analysis
It is possible to generate waypoints (photograph location) on the top of each terrain sampling point with different altitudes [8]. The collection of optional waypoint set is R = {p 1 , ..., p m } . Our goal is to find an ordered set A covering flight path l is formed when the UAV flies along with this ordered set of geographical coordinate points, The problem of measuring the visible range of a point on GIS can be transformed into two points visual judgment problem. We use the interpolation visibility analysis method to judge whether two points are visible.
The detail of the interpolation visibility analysis method shows as Algorithm 1. GIS can calculate the elevation (lat 1 , long 1 , e 1 )...(lat n , long n , e n ) according to latitude and longitude. e t i represents elevation data obtained by GIS computation.
When the interpolation density reaches a specific number, we can effectively judge whether a and b are visible in the line of sight. We can see in Fig. 2. In the horizontal direction, interpolation is performed every 5 m along the line segment l.
. Therefore, we cannot see C from A.

UAV Intelligent Coverage Navigation Based on DRL
We present a UAV intelligent navigation method based on DRL.
A UAV is an agent in the DRL structure. We use a four-rotor UAV to perform the coverage mission. The UAV continuously improves its strategy to maximize the covering tasks cumulative rewards by collecting samples (states, actions, and rewards) from its interactions with complex terrain environments. We use a neural network to simulate the E(l) function. Input the state into the neural network and output the UAV flight action value. At each discrete time step, the UAV selects an action from the action space (up, down, latitude+ , latitude−, longitude+ and longitude−). We creatively use the UAV flown path as the current UAV state, which is the input neural network state.

UAV Coverage Navigation State
We creatively save the flight path in the form of an image and use the image as the input of DNN. The DNN has an excellent ability to analyze and understand data in image form. If the UAV path information is converted into the image form, the DNN will be better for feature extraction and classification. We propose the elevation compression method to convert 3D GPS path data into a 2D image.
To illustrate the algorithm, we take Fig. Convert3DGP-Spathdatainto2Dimage as an example. Suppose Fig. 3a is the initial state. With the elevation compression method, we get a two-dimensional map of the path, Fig. 3e. Starting from Fig. 3a, after 20 actions, the status of the UAV is shown in Fig. 3b. Figure 3(f) is the two-dimensional map of Fig. 3b. Figure 3c, d is the UAV path at intervals of 20 actions. Figure 3g, h is the mapping of 3c, d. We see that as the path increases, the total of gray rectangle decreases. The DNN can effectively recognize the change of image grayscale and distinguish the path.
The elevation compression method is shown in Algorithm 2. The step 2 stores waypoints in the 3D matrix M. We convert  where n i is the number times of i has been simulated. ∈ (0, 1] . If the vector s → i is going in the same direction as s → g i on x, y or z axis, ∈ (0, 1) . Otherwise, = 1 . Choose the minimum score in different directions as the simulation node.

UAV Coverage Navigation Reward Based on TK-MCTS
The specific algorithm is Algorithm 3. According to the results of flight simulation in different directions, we set the reward value of the direction with the best simulation results to be 1, the other direction is 0.

UAV Coverage Navigation DRL Implementation
The core of reinforcement learning is the discovery of the optimal action-value function Q * (s, a) by maximizing the expected return, starting from state s taking action a: (1) score = (n i + 1) * rand(0, 1) * The total future reward until the termination is R t . The represents DRL policy. With future reward discounted factor is , the total future estimated reward is The essential assumption in DRL is the Bellman equation, which transfers the target to maximize the value of r + Q * (s � , a � ) as where s ′ is the next state. DRL estimates the action-value equation by convolutional neural networks with weights , so that Q(s, a, ) ≈ Q * (s, a) . Set the training batch size to be b, the loss function is where y k is the target output evaluation network. It is calculated by future expectation estimated. If the sampled transition is not a fly way sample, the evaluation for this (s k , a k ) pair is set as termination reward r ter .
In this paper, the shape of the input terrain image is 10 × 10 × 3 . The structure is depicted in Fig. 4. We use two 3D convolutional layers to extract the path image features. We apply a 10 × 10 × 4 (the number of channels = 4) and 10 × 10 × 64 (the number of channels = 64) convolution filter on images. The role of the convolution layer is local perception. That is, each feature in the picture is first perceived locally, and then the local comprehensive operation is carried out at a higher level, to obtain global information. Each convolution layers calculation results are input to the pooling layer. The main role of the pooling layer is to reduce the feature dimension. In Fig. 4, long short-term memory (LSTM) cannot only solve that RNN cannot deal with long-distance dependence but also solve that gradient explosion or gradient disappearance. We use an additional two fully connected layers for exploration policy learning. Finally, the neural network outputs the UAV action values. Each Conv or Fullyconnected layer is followed by a Rectified Linear Unit (ReLU) activation function layer to increase the non-linearity. The number under each layer is the output data channels of the cubes. Algorithm 4 shows the workflow of our revised DRL process. We set the number of iterative rounds episode to M. As shown in steps 4-16, perform an action a t in state s t , get the next state t t+1 and the current reward r t . Adds the four tuples (s t , a t , r t , s t+1 ) to the experience pool D. Take m random samples (s i , a i , r i , s i+1 ) from the empirical pool D, where i = 1, 2, 3...., m , calculate the target value y i . Step 20 uses the mean square error loss function 1 to update the Q network parameters. We use the memory replay method and the − greedy training strategy to control the dynamic distribution of training samples. At the beginning of every repeated exploration loop, the UAV is set to a random start point. It extends the randomization of the UAV locations from the whole simulation world and keeps the diversity of the data distribution saved in memory replay for training.

Terrain Data Sampling and Visibility Analysis
On the Cesium platform [39], we chose the geographic location N86.8 • − 87.0 • , E27.5 • − 28.02 • as the area for the terrain coverage task. We sample the 2D coordinates along longitude − axes and latitude − axes on the ground plane with a step size of 0.628 km. We obtain each terrain sampling point elevation corresponding by Cesium. Each optional waypoint does visibility analysis. On the Cesium, we use Algorithm 1 (interpolation visibility analysis method) to obtain visibility set v i of each optional waypoint p i . We randomly pick 10 waypoints, as shown in Fig. 5. The average visual distance of optional waypoints to the whole terrain sampling points is more than 2000 m. We set the effective visibility distance threshold value as 1500 m. Each optional waypoint can only see a few numbers of terrain sample points.

The TK-MCTS Performance
We use different algorithms in the same flight terrain space to verify the advantages of the TK-MCTS algorithm. We compare the total number of searches and the number of valid searches with the exhaustive method and traditional MCTS algorithm [40,41]. As we can see from Fig. 6(a), the Exhaustion method performs a broader range of searches in the same time frame. However, the successful full coverage terrain path searches account for only 0.0388% of the total searches. The traditional MCTS search has the smallest search range in the same period. The traditional MCTS method successfully searches the full coverage terrain path more times than the Exhaustion method in the same time frame. We see that although the TK-MCTS method does not have the most searches in the same time frame, it is the most efficient way to find the full coverage path. The number of successful full coverage terrain path searches accounts for 47.29% of the total searches. The space complexity of Algorithm 4 is closely related to the neural network layers. Therefore, we need to analyze the space complexity layer by layer. Assume that the convolution kernel size is H × W , the input channel is I and the out channel is O. The total number of filters in the convolutional layer is H × W × I . Each filter will be mapped to O new channels. Plus a bias for each filter's calculation. Therefore, the total number of parameters is (H × W × I + 1) × O . Pooling Layer is a fixed operation with no weighting factor. A fully connected layer is an n input ,m output dimensional input and output with (n + 1) × m parameters. The LSTM will maintain a total of 4 sets of parameters, corresponding to input gates, output gates, oblivion gates and candidate states. Therefore, the total number of parameters is 4 × (n hidden m input + n 2 hidden + n hidden ) . n hidden is the hidden size and m input is the input size.

A Combination of Local and Global Search
The spatial complexity of the coverage path planning search is approximately O(n!). Although the TK-MCTS approach significantly improves the effectiveness of the simulation, it is not possible to conduct extensive searches in nearly infinite search spaces. On a mediocre computer, the average number of valid searches is 326.89 when the search time is 10 s. Finding the full coverage path is 153.12 times. The limited number of simulation samples will bring high errors to evaluate the effectiveness of the current motion direction.
Starting from the current waypoint, we calculate the time and coverage to view all paths within n steps using an exhaustive method. Figure 7a shows the impact of the different steps on coverage. Figure 7a shows that merely increasing the number of exhaustive searches does not improve the simulation. The n − step exhaustive method can only determine whether it is a locally optimal solution. We cannot simulate the optimal global solution by increasing n. As n − step increases the exhaustion time index increases as well.
For the above problems, we design a method combining local search and global search to determine the simulation reward value. Through the TK-MCTS method, we can roughly estimate the optimal global direction. By n − step exhaustion, we determine the direction of the optimal local solution. The simulated reward design is shown in Table 1.
Negative no waypoint can fly to in the current state; TK − MCTS positive only through TK-MCTS gets a positive reward; Exhaustive positive only through Exhaustive gets a positive reward; Exhaustive and TK − MCTS positive through Exhaustive and the TK-MCTS Positive get a positive reward; Other no simulation results are obtained. Through Exhaustive and the TK-MCTS does not obtain a valid path coverage method

DRL Intelligent Path Planning
We initialize each layer weights from a normal distribution (mean 0 and variance 0.3), and the biases are set as 0.1. The training parameters are shown in Table 2. All models are trained and tested with TensorFlow on a single NVIDIA GeForce GTX 1050ti. The impact of batch size on training is as follows: if it is too small, it will lead to great gradient change, loss oscillation, and network difficult convergence; if it is too large, the gradient is very accurate, loss oscillation is small, and it is easy to fall into local optimum. Plenty of practice shows that the best training results are always achieved when the batch size is between 2 and 32. The effect of learning rate on training is as follows: too small means it takes longer to converge; too large may not converge or the loss may explode. The initial learning rate is usually set at 0.01-0.001. Replay memory breaks the correlation between data by storage-sampling. There is no particular requirement for the replay memory size. discountfactor = 0.9 , which balances the current and future rewards.
We have analyzed the validity of DRL intelligent path planning through experiments. We evaluate the proposed work in terms of coverage quality and path planning quality in Cesium. With varying step sizes, we compare the coverage percentage of our proposed algorithm with the exhaustion algorithm and MCTS algorithm. The computation results help to determine the appropriate step size to ensure terrain full coverage with the generated waypoints.

Varying Step Size Coverage Performance
The distance between the terrain sample point and waypoint directly affects the resolution of the terrain image obtained by the UAV. We set the visual range threshold from a waypoint to the terrain sampling point as 1000 m. The total UAV step number can approximately represent the UAV carry out task time. The coverage results for the target area are shown in Fig. 8. The vertical axe presents the percentage of areas that can be covered with the given image resolution requirement. The step size in the horizontal axes represents the number of blocks. We step through the environment in each iteration while searching for waypoints, and step size values ranging from 1000 to 1900 are tested. In Fig. 8, the UAV starts from different positions to verify the effectiveness of the DRL method in training UAV intelligent navigation. DRL algorithm, Exhaustion algorithm and MCTS algorithm are adopted to calculate the average coverage of different steps. The search time of Exhaustion and MCTS is 300 s. The UAV performs multiple searches of a given step length within the 300 s. We can see that when step = 1700, the terrain coverage obtained by the DRL method can reach 100% . From 1000step to 1900-step, DRL can always achieve better coverage than the Exhaustion algorithm and MCTS algorithm. Moreover, from 1000-step to 1900-step, the MCTS algorithm can obtain a better coverage effect than the Exhaustion algorithm. During the initial training process, DRL selects the action calculated by the DNN with a 50% probability for each step, and randomly selects action with a 50% probability. The method allows a full exploration of UAV navigation in various situations. Through experiments, we prove that DRL could learn excellent results with the TK-MCTS simulation.

Terrain Average Coverage
In coverage missions, we are concerned about the UAV effective coverage. All area's effective coverage requires high resolution of every terrain data collected by UAV. We have heat maps to visually describe the performance of the DRL algorithm in improving UAV terrain coverage quality. Figure 9a-c is the heat maps of the DRL algorithm, MCTS   algorithm and Exhaustive method, respectively. We use shades of color to indicate distance. Darker the color, the larger the distance is. The horizontal axis represents the latitude, and the vertical axis represents longitude. First, we see that the DRL heat map Fig. 9a is generally lighter in color than the other two, and the MCTS heat map Fig. 9b is lighter in color than Fig. 9c. We can see that the maximum distance from the waypoint to the terrain sampling point can be less than 200 m with the coverage effect of navigation by the DRL algorithm. With the coverage effect of navigation by the MCTS algorithm, the maximum distance from the waypoint to the terrain sampling point is 600-700 m. The maximum distance from the waypoint to the terrain sampling point will be larger when it uses the Exhaustive method navigation. Second, we see that the coverage achieved by the DRL method is very even for each piece of terrain. Since some parts are light and some parts are dark, the coverage effect of each piece of terrain by other methods is uneven. Especially the

Conclusion
In this work, we develop a DRL framework for UAV navigation in large-scale complex environments. We can use GIS with accurate and rich data as the training environment by converting GPS path data into image data. This method can effectively overcome the huge errors caused by the inconsistency between the training environment and using environment. The combination of the TK-MCTS search method and local search method cannot only effectively avoid falling into local search, but also ensure to learn the optimal search direction under the limitation of effective computing force.
The results show that CNNs can learn important features from GPS path information of the 3D environment, and learn a navigation policy from the TK-MCTS simulation. The integration of visual navigation and GPS navigation is the research direction to improve navigation accuracy. For the unknown terrain environment, the combination of online 3D terrain generation and GPS navigation is also an important issue to be studied in the future.