Digital twin based DDPG reinforcement learning for sum-rate maximization of AI-UAV communications

Lee, Jeongyoon; Park, Taeje; Sung, Wonjin

doi:10.1186/s13638-024-02386-0

Digital twin based DDPG reinforcement learning for sum-rate maximization of AI-UAV communications

Research
Open access
Published: 18 July 2024

Volume 2024, article number 57, (2024)
Cite this article

Download PDF

You have full access to this open access article

EURASIP Journal on Wireless Communications and Networking Submit manuscript

Digital twin based DDPG reinforcement learning for sum-rate maximization of AI-UAV communications

Download PDF

130 Accesses
Explore all metrics

Abstract

Construction of wireless infrastructure using unmanned aerial vehicle (UAV) can effectively expand the coverage and support high-density traffic of next-generation communication systems. Designing wireless systems including UAVs as aerial base stations (ABSs) is a challenging task, due to the mobility of ABSs causing time-varying nature of environmental surroundings and relative propagation paths to user equipment (UE) devices. Therefore, it is essential to have an accurate estimate of the channel for varying positioning of the UAVs. In this paper, we propose to adopt a digital twin based performance evaluation procedure for wireless systems including ABSs, providing enhanced accuracy of channel modeling for specific target deployment areas. Using ray-tracing channel models reflecting detailed building and terrain information of the transmission environment, an UAV position optimization algorithm based on reinforcement learning is presented. By utilizing deep deterministic policy gradient (DDPG), the proposed algorithm calculates the overall throughput in the digital twin and determines the desired states of the UAV. Performance evaluation results demonstrate the trajectory training ability of the algorithm and the performance advantage of the system with a reduced amount of shadow area compared to those with ground base stations (GBSs).

Deep Reinforcement Learning for Jointly Resource Allocation and Trajectory Planning in UAV-Assisted Networks

Trajectory optimization for UAV-assisted relay over 5G networks based on reinforcement learning framework

Article Open access 01 July 2023

Delay Minimization in Multi-UAV Assisted Wireless Networks: A Reinforcement Learning Approach

1 Introduction

Significant technical advancements are required to meet users' expectations for next-generation communication systems. Consequently, there is a demand for systems that can reliably support large-scale data transmission and reception, enabling key applications such as augmented/virtual reality, autonomous driving, high-precision product manufacturing, and real-time information sharing [1]. To address this demand, the integration of comprehensive networks in three-dimensional space combining terrestrial networks with aerial/satellite/space communication utilizing mobile aerial base stations (ABS) is under investigation [2,3,4,5], as well as practical means to apply the artificial intelligence (AI) algorithms [6,7,8] and digital twin technology [9,10,11].

Unmanned aerial vehicles (UAVs) serve as an economical and efficient means for implementing ABS, as they can be deployed to areas with concentrated data demand, ensuring line-of-sight (LOS) to user equipment (UE) for high-speed data transmission. UAV can dynamically change their positions, providing flexible coverage for target areas. This not only enhances the quality of mobile communication services but also has been proven effective in scenarios such as disaster situations or temporary surges in communication demand [12, 13]. The wireless communication system utilizing an ABS consists of a backhaul link between ground base stations (GBS) and the ABS, as well as an access link from the UAV to ground UEs. Due to the mobility of the UAV, both the backhaul and access links exhibit time-varying characteristics. Therefore, high-precision beamforming technology is required to ensure the optimal transmission performance.

Optimizing communication systems by leveraging the high utility characteristics of UAVs and AI algorithms has emerged as an important research topic. For UAV position prediction, a long short-term memory based recurrent neural network was designed [13] and new double deep Q-network model has been proposed that maximizes the sum-rate in the uplink communication system where the UAV acts as a base station (BS) [14]. The results assume the stochastic loss channel model and relies on the assumption of having perfect channel state information (CSI) [14, 15]. The use of random waypoint mobility model and reference point group mobility model to generate UE locations introduces a drawback, as these assumptions may deviate from reality [15]. In [16], transmission links are considered to suffer shadowed-Rician and Rayleigh fading. In [17], it is assumed that all the channels experience quasi-static block fading and assumed statistic CSI or perfect CSI, which can be applied to evaluate the system performance in various environmental circumstances. Utilizing UAVs as ABSs with the hybrid beamforming capability using massive multiple-input multiple-output (MIMO) has been studies in [18], where bicycle UEs are generated based on the actual roads in Belgium. Signal attenuation was calculated using ray-tracing to determine the channel and beamforming was applied based on the ray-tracing channel. The utilization of a UAV as a node between road side units and vehicles with the aid of the deep learning-based resource allocation algorithm has been presented in [19]. In this study the variables of environmental factors including the time slots and local computation capacity are defined as digital representation.

Optimization of UAV-related parameters has also been intensively studied, especially focusing on the development and utilization of AI algorithms in the optimization process. Reinforcement learning algorithms such as deep deterministic policy gradient (DDPG) and deep Q-network (DQN) are commonly utilized in many application areas such as using DDPG models to train UAVs to safely navigate to their destination while avoiding obstacles [20], and managing UAVs using DDPG models to maximize energy efficiency [21, 22]. Also, the DQN has been utilized for optimizing the transmission power and the location of the UAV ABS for maximizing the data rate of ground UE [23]. A hierarchical deep Q-network was developed for the beam alignment between UAV UE and BS in the millimeter-wave band [24]. In this case, location information reduces the computational complexity during beam searching and efficiently maximizes the data rate. Optimization of UAV BS trajectory and phase shifts of intelligent reflecting surface elements to maximize the data rate and energy efficiency of UE is discussed in [25], where the performance of DQN and DDPG is compared. The asymmetric long short-term memory-deep deterministic policy gradient algorithm is proposed to improve the system ergodic sum-rate in satellite-aerial-ground integrated networks [26], and the multi-objective deep deterministic policy gradient algorithm is designed for trajectory optimization and beamforming design when UAVs are utilized as reconfigurable intelligent surfaces [27]. Also, joint optimization design for satellite terrestrial integrated network is proposed for the sum-rate maximization in [28]. The Kalman filter based method, particle swarm optimization, and the Monte-Carlo coupling algorithm to optimize the operation of UAVs can be found in [29,30,31].

AI algorithms can be utilized to effectively operate UAVs, offering the advantage of automatically optimizing UAV parameters based on specific environmental conditions. In the process of acquiring the operating environment for UAVs, one can construct a digital twin for accurate simulation of the wireless channel and user mobility [9]. A versatile utilization of UAVs can be achieved by constructing a digital twin of the indoor space using the Unity software, allowing for real-time remote control of UAVs [32]. 3-dimensional (3D) laser scanning can collect the point cloud data of the indoor workspace and various 3D software tools such as 3ds Max can be used to create outline models of equipment. A highly realistic digital twin can be created by importing data into a rendering platform in FBX format [33]. Multiple-input single-output (MISO) downlink transmission modelling for UAV communications in such environments can be found in recent excellent works in [34] and [35]. These cases often involve performance evaluations based on ideal assumptions and virtual environments. Applying such virtual environments to real-world scenarios can introduce non-economic and impractical aspects due to the disparities with the actual environment. To ensure more economical and practical operation of a UAV as the ABS, it is essential to implement real-time analysis and a controllable digital twin system. Such a system should consider the characteristics and conditions of time-varying environments, including user mobility, positions, base station locations, and phenomena such as radio scattering and refraction. Utilizing a digital twin allows virtual simulation and testing in various technological areas before actual deployment or operation, leading to advantages in saving time and costs. In real-world scenarios where information is distributed and comprehensive monitoring is challenging, a digital twin environment enables integrated management by consolidating operational data, addressing the limitations of decentralized information.

The contributions of this paper are follows. First, we present a systematic procedure of constructing the digital twin applicable to environment-aware channel coefficient generations for wireless communication systems. The procedure can be adopted for high-precision performance simulations and predictions in many wireless systems using the set of available software packages explained in this paper. By using the ray-tracing algorithm combined with exact building and terrain information of the environment, the resulting performance simulation provides a highly accurate position-dependent channel model unlike probabilistic channel modelling used in existing 3GPP specifications. Second, we demonstrate how the constructed digital twin can be utilized for performance enhancement of the wireless system by combining the twin model with AI algorithms. In particular, we apply reinforcement learning to optimize UAV positioning within the digital twin control system to maximize the sum-rate of multi-user signal transmission. This optimized position information can then be fed back to the physical environment for desired UAV operation. In practical operational scenarios, the reinforcement learning algorithm is updated in real-time to maximize the reward to produce the desired position for ABS signal transmission. Additionally, we validate our approach by comparing the resulting performance in operational scenarios with and without the digital twin. The rest of the paper is organized as follows. In Sect. 2, the system model, the construction procedure of the digital twin, ray-tracing operation, and the DDPG algorithm are explained. Section 3 includes experimental evaluation results and related discussions. The conclusion is given in Sect. 4. The parameters and notation used in this paper is listed in Tables 1 and 2.

Table 1 Parameters and notation for the signal model, the digital twin, and ray-tracing

Full size table

Table 2 Parameters and notations for DDPG

Full size table

2 Methods/experimental

2.1 Signal model

A muti-user MISO downlink transmission scenario is considered, where the users receive signal when BS transmits the signal with $M_{T}$ antenna elements to $K$ single-antenna UEs. The received signal vector ${\varvec{y}} = \left[ {y_{1} , y_{2} , \ldots , y_{K} } \right]^{{\text{T}}}$ for $K$ users is expressed as

$${\varvec{y}} = {\varvec{HFs}} + {\varvec{n}}$$

(1)

where ${\varvec{H}} = \left[ {{\varvec{h}}_{1}^{{\text{T}}} ,{\varvec{h}}_{2}^{{\text{T}}} ,\user2{ } \ldots ,{\varvec{h}}_{K}^{{\text{T}}} } \right]^{{\text{T}}} \in C^{{K \times M_{T} }}$ is the channel matrix including the kth user's channel vector ${\varvec{h}}_{k} = \left[ {h_{k,1} ,h_{k,2} , \ldots ,h_{{k,M_{T} }} } \right]$. The coefficients of each channel vector are obtained from digital twin based ray-tracing simulation process. Also, ${\varvec{F}} = \left[ {{\varvec{f}}_{1} {, }{\varvec{f}}_{2} {, }...{, }{\varvec{f}}_{K} } \right] \in C^{{M_{T} \times K}}$ is the beamforming matrix including the vector ${\varvec{f}}_{k} = \left[ {f_{k,1} ,f_{k,2} , \ldots ,f_{{k,M_{T} }} } \right]^{{\text{T}}}$ with power constraint ${\text{Tr}}\left( {{\varvec{f}}_{k} {\varvec{f}}_{k}^{{\text{H}}} } \right) \le M_{T} /K$. The transmitted signal vector is denoted as ${\varvec{s}} = \left[ {s_{1} , s_{2} , \ldots , s_{K} } \right]^{{\text{T}}}$ with $E\left[ {s_{k}^{2} } \right] = 1$, and ${\varvec{n}} = \left[ {n_{1} , n_{2} , \ldots , n_{K} } \right]^{{\text{T}}}$ is the additive white Gaussian noise vector with $E\left[ {n_{k}^{2} } \right] = \sigma_{n}^{2}$. The received signal-to-interference-plus-noise ratio (SINR) is given by

$$\Gamma_{k} { = }\frac{{\left| {{\varvec{h}}_{k} {\varvec{f}}_{k} } \right|^{2} }}{{\mathop \sum \nolimits_{l \ne k} \left| {{\varvec{h}}_{k} {\varvec{f}}_{{\varvec{l}}} } \right|^{2} + \left| {n_{k} } \right|^{2} }}$$

(2)

and the achievable sum-rate for all UEs is determined as

$$R = \mathop \sum \limits_{k}^{K} \log 2\left( {1 + \Gamma_{k} } \right).$$

(3)

2.2 Digital twin construction

The digital twin enables various accurate simulations. In our digital twin, ray-tracing simulation and DDPG reinforcement learning are conducted. The digital twin provides useful and accurate information and predictions to the real space in a sufficiently small cycle denoted by symbol $t_{{{\text{update}}}}$. In 5G new radio communication, the connection between the UE and the network is established through the initial access process, providing signals for channel estimation and beam selection [36]. Once the connection is established via the initial access procedure, the channel information is continuously updated using the channel state information reference signalling (CSI-RS) in the time intervals of $\tau \in$ {5, 10, 20, 40, 80, 160} ms. The proposed optimization process for UAV based on the digital twin outputs the optimal position of the UAV at every $t_{{{\text{update}}}}$ time instance, as depicted in Fig. 1. Here, $t_{{{\text{update}}}}$ is set shorter than the minimum value of 5G reference signalling intervals, enabling enhanced communication performance to occur. In a communication system where both UEs and UAVs possess velocities, the signal coherence time becomes finite. Therefore, $t_{{{\text{update}}}}$ is adjusted to be shorter than the overall coherence time.

To construct the digital twin, information such as the map, the terrain contour, building shapes, and the positions from physical space are applied for 3D modelling to accurately estimate the channel. Once the 3D model is constructed, the UAV position information and the UE locations acquired by the GPS in the physical space are applied to the digital twin. The UE locations can be actual values acquired by additional sensors. In our simulation, typical UE behaviours are obtained by Simulation of Urban MObility (SUMO), a traffic simulation software that supports various modes of transportation and enables the generation of traffic flow based on the embedded road information. These UE behaviours are applied to the previously generated 3D model to reflect the terrain. During the experimentation process, reinforcement learning is conducted using fixed UE location data assumed to represent the real-time positions of a moving vehicles generated from SUMO. These fixed UE location data result from sampling the movement of vehicles. Using these locations, exact channel coefficients are generated using the ray-tracing procedure, which are used to estimate the optimized position of the ABS via DDPG reinforcement learning. The digital twin functionalities can be implemented as a part of the GBS management software, conducting the ray-tracing and DDPG position optimization procedure to determine the ABS position. Only one UAV ABS is assumed to be used within the target area of interest, and the wireless backhaul link to the UAV is assumed to be perfect with no interference between the GBS downlink signal and the UAV ABS downlink signal.

Open source tools are applied to generate the 3D model, incorporating the position and contour information of surrounding terrains and buildings. Blender, a 3D computer graphics software, facilitates the import of various data formats into a workspace and enables the editing of objects' shapes or movements. Blender also allows users to download and utilize extra add-ons, such as BlenderGIS, to create the 3D communication environment model. The process of importing data into the Blender workspace involves the sequential execution of tasks including obtaining satellite images from Google Maps, topographical data from OpenTopography, and 3D building information from OpenStreetMap. This order of operations ensures the building placement to follow the contours of the terrain. However, the imported topographical data may exhibit variations compared to Google Earth depending on the region. Adjustments are made to the elevation intensity of the terrain, referencing Google Earth, to closely align the terrain data with that of Google Earth. Furthermore, the building information imported from OpenStreetMap exhibits differences in aspects such as height and provides only a rough approximation of building shapes when compared to S-Map, which is LiDAR based 3D Map made by Seoul Metropolitan Government. To enhance the accuracy, building information is supplemented by referencing S-Map, leading to the addition of buildings in accordance with its data. Mitsuba Blender is a Blender Add-on to export 3D communication environment model as XML data format. To perform ray-tracing based on the digital twin, the relevant add-on was employed to save the 3D model in XML format. This add-on is only supported in Blender V3.4.1. SUMO traffic generation is based on OpenStreetMap road information. OpenStreetMap is an open-source map which provides diverse information, including roads, trails, railways, buildings, and other features worldwide. It is used in BlenderGIS add-on to import building information, and SUMO to import road information. Sionna is an open-source communication simulation library developed by NVIDIA based on TensorFlow and Python. It enables link-level simulations for the physical layer of wireless communication systems and supports Python versions 3.6–3.9 and TensorFlow versions 2.10 and above. Installing NVIDIA GPU drivers, CUDA, and cuDNN allows for accelerated simulations using GPU.

To initiate terrain modelling, the BlenderGIS Add-on is employed to import the image plane of the desired region from Google Maps and subsequently integrate 3D topographic details provided by OpenTopography. It is crucial to verify the accuracy of the terrain information across different regions. Adjustments to the terrain contours were made using the strength parameter in Blender’s Modifier Properties. Terrain refinements include adjustments to the coordinate space, such as the origin and orientation of the digital twin. After applying the terrain contours, the 3D building information from OpenStreetMap is imported into the workspace. It is important to note that depending on the location and building size, OpenStreetMap may not have implemented buildings in certain areas. In such cases, extra building objects are created within Blender, referencing Google Maps, to accurately represent the buildings in the specified region. Subsequent to modelling the 3D communication environment through the aforementioned process, material properties of objects were specified to facilitate ray-tracing based on Sionna. Guidance on object material assignments can be found in the documentation provided in [30]. The material of objects in the digital twin has an impact on ray-tracing when signal reflection occurs, as the attenuation of the signal path is influenced by the material properties of the reflector.

UE modelling is performed by configuring vehicle generation parameters and the environment using SUMO. The target area’s road information is downloaded from OpenStreetMap in OSM data format and transformed into NET.XML format to be compatible with SUMO. Given the potential disparities between the downloaded road information and the actual environment, adjustments were made in SUMO after importing road information, including modification of lane count, lane direction, and shape. The SUMO program and JSON-formatted code were then utilized to configure parameters such as the number of vehicles, vehicle coordinate sampling interval, lane-change behaviour, and traffic signal generation. SUMO is executed to generate 2-dimensional (2D) vehicle coordinate data using python-based programming. According to Blender and SUMO documentation, the unit of parameters such as lane length and object coordinates in Blender and SUMO is meter, while the sampling interval of vehicle coordinates in SUMO is second. The 3D communication environment model and 2D vehicle coordinate data are incorporated to determine the altitude of the terrain in Blender. This information is used to update the 2D coordinate into 3D vehicle coordinate data, which is the final step in modelling the digital twin UE. Figure 2 shows the visualization of digital twin model for the Daeheung intersection near Sogang University Campus in Seoul. The origin is situated at the centre of the intersection, and an area of dimension 150 m × 150 m is created. UE locations are indicated by blue cars in figure parts (a) and (b), which respectively are south and west view of the intersection. The ABS location is shown by the red UAV in these figure parts. These figures illustrate the scenario of the UAV at the example position (0, 0, 50) [m], and the UAV position changes throughout the reinforcement learning process. The total number of UEs is 57, positioned within a 150 m × 150 m range on the xy-plane and spanning from −10 to 10 m in the z-direction, following the terrain's contours. In figure parts (c) and (d), the locations of the transmitter and the receivers are respectively depicted in red and blue circles for clearer representation.

2.3 Ray-tracing operation

Ray-tracing is a method of computing the signal propagation model with modified geometrical method [37, 38]. Computing the direct rays and reflected or diffracted rays is the main process of ray-tracing. Direct rays are called line-of-sight (LoS) rays and reflection/diffraction rays are called non line-of-sight (NLoS) rays. Channel coefficients generated using ray-tracing based on exact geographical characteristics are known to provide highly accurate resemblance to physical channel values determined in the same experimental settings We adopt the newly developed Sionna software libraries in Python for ray-tracing operation [39], which are well matched to 3D geometry information provided by the digital twin. The shoot-and-bouncing rays algorithm implemented in Sionna has been verified to produce very accurate modelling of the physical channel in extensive investigation results [40,41,42]. We apply precise transmitter and receiver locations within the pre-defined geometry, and exact transmission parameters such as given in Table 1 to generate channel coefficients used for performance evaluation. While conventional stochastic channel models define the number and distribution patterns of clusters to determine path loss, angles of arrival and departure, and delay spread in a probabilistic fashion, the digital twin model provides deterministic channel parameters with proven accuracy at geometric locations of interest. The channel coefficients from the mth antenna to the kth user can be expressed as

$$h_{k,m} = \mathop \sum \limits_{{n_{p} = 1}}^{{N_{p} }} a_{{n_{p} }} e^{{j2\pi f_{c} \tau_{{n_{p} }} }} \delta \left( {\tau - \tau_{{n_{p} }} } \right) \approx \mathop \sum \limits_{{n_{p} = 1}}^{{N_{p} }} a_{{n_{p} }} e^{{j2\pi f_{c} \tau_{{n_{p} }} }}$$

(4)

where $n_{p}$ is the index of $N_{p}$ multi-paths, $a_{{n_{p} }}$ is the path attenuation, and $\tau_{{n_{p} }}$ is the path delay of $n_{p}$th multi-path. For most multi-carrier transmissions with sufficient symbol duration, the path delay $\tau_{{n_{p} }}$ is negligibly small, averaging about 280 ns in scale. Hence, it is approximately ignored, leading to the rough disregard of the delta term in Eq. (4). Consequently, the channel coefficient is approximated by summing the path coefficients of each multi-path. Figure 3 is the visualization of ray-tracing result in the digital twin when UAV is located at $\left( {x, y, z} \right) = \left( {0,{ }0,{ }50} \right)$ [m].

2.4 DDPG for UAV position optimization

We apply the DDPG algorithm to maximize the overall sum-rate performance of wireless access link via the UAV position optimization. DDPG is a reinforcement learning algorithm that extends the concept of DQN to handle continuous action spaces. Developed as an enhancement of the existing deterministic policy gradient algorithm, DDPG incorporates mechanisms such as experience replay, nonlinear function approximation, and target networks to improve the efficiency of data sampling and enhance the stability of learning [33]. DDPG enables an agent to execute continuous actions in a continuous state space and learns a policy in a model-free environment to perform a specific action at a specific state. It consists of 4 deep neural networks (DNNs): Policy network, policy target network, critic network, and critic target network, along with one experience replay buffer. Given the state $s$ and the next state $s{\prime}$, action $a$, and the reward $r$, the experience replay buffer stores tuples represented as $\left( {s, a, r, s^{\prime} } \right)$.

Markov decision process (MDP) can be used to explain various considerations in DDPG. The MDP consists of state space, action space, reward function, policy function $\pi$, and state transition probability. When precise state transition probabilities are not readily available as in typical real-world situations, solving MDPs can be achieved through DDPG learning. State space information represents the information on the objects after the agent interacts with the environment. The DDPG environment consists of UAV ABS within digital twin and downlink communication system from UAV ABS to UEs. We have set the state to represent the position of the UAV in the 3D coordinate system, to express the state as vector ${\varvec{s}} = \left( {s_{1} , s_{2} , s_{3} } \right)$. Each element represents the coordinate along the $x$, $y$, and $z$-axis. Action space represents a group of available actions, and action is the output of policy function with its input as a state. The agent learns the policy network to maximize the Q-value, and then performs action obtained from it. Q-value is the expected value of rewards throughout the learning process. Action is generated by adding noise to the output of the policy network and has dimension of 3, so it can be expressed as ${\varvec{a}} = \left( {a_{1} , a_{2} , a_{3} } \right) = \pi \left( {\varvec{s}} \right) + {\varvec{n}}_{{\varvec{l}}}$, where ${\varvec{n}}_{{\varvec{l}}}$ is added noise vector. The noise provides exploration capabilities to the DDPG model, thereby facilitating enhanced learning. Commonly used types of noise are the Ornstein–Uhlenbeck (OU) noise and the Gaussian noise [43]. The Ornstein–Uhlenbeck process is a random process which is an improved mathematical model for the white noise that accounts for the filtering of high frequencies, generating the noise which relies on previous states [44]. Therefore, OU noise was chosen instead of Gaussian noise based on the characteristic of reinforcement learning where previous learning influences subsequent learning.

After the agent performs an action, the reward and the Q-value are updated. Maximizing the Q-value is the goal of DDPG learning because it leads to the maximum reward throughout the learning process. The agent interacts with the environment by performing actions in a given state and receiving rewards. We determine the sum-rate by performing rank-adaptive zero-forcing (ZF) beamforming using the generated channel coefficient. The average sum-rate is set as the reward, which is expressed as

$$r = \frac{1}{{N_{R} }}\mathop \sum \limits_{n}^{{N_{R} }} \mathop \sum \limits_{k}^{K} \log_{2} \left( {1 + \Gamma_{k} } \right)$$

(5)

where $N_{R}$ represents the transmission rank. The policy function determines which action the agent will take based on the state, aiming to maximize the Q-value. In DDPG model, the policy network serves the role of the policy function, and it is updated in the direction of maximizing the Q-value. Agent takes an action in the environment by adding noise to the output of the policy network and using it as action. We determined the optimal position of the UAV for the next state as

$${\varvec{s}}^{\prime} = \left( {s_{1}^{\prime} , s_{2}^{\prime} , s_{3}^{\prime} } \right) = \left( {\alpha_{1} a_{1} + c_{1} , \alpha_{2} a_{2} + c_{2} ,\alpha_{3} a_{3} + c_{3} } \right)$$

(6)

where $\{ \alpha_{1}$, $\alpha_{2}$, $\alpha_{3} \}$ is the set of scaling factors and $\{ c_{1}$, $c_{2}$, $c_{3} \}$ is the set of offset values which determines the range of optimized UAV position. DDPG learning progresses by continuously updating the behaviour network and target network. The behavior network consists of both the critic network and the policy network. The critic network takes both the state and the action as input and determines the Q-value given by

$$Q^{\mu } \left( {s_{{t_{i} }} , a_{{t_{i} }} } \right) = {\text{E}}_{{r_{{t_{i} }} , s_{{t_{i} }}{\prime} \sim {\mathcal{B}}}} \left[ {r_{{t_{i} }} + \gamma Q^{\mu } \left( {s_{{t_{i} }}^{\prime} , \mu (s_{{t_{i} }}^{\prime} } \right)} \right)$$

(7)

where ${\mathcal{B}} = \left\{ {\left( {s_{{t_{1} }} , a_{{t_{1} }} , r_{{t_{1} }} , s_{{t_{1} }}^{\prime} } \right), \ldots ,\left( {s_{{t_{{N_{mb} }} }} , a_{{t_{{N_{mb} }} }} , r_{{t_{{N_{mb} }} }} , s_{{t_{{N_{mb} }} }}^{\prime} } \right)} \right\}$ is the randomly sampled mini-batch from experience replay buffer ${\mathcal{D}}$. The weights of critic network are updated in the direction of minimizing the temporal-difference (TD) error

$$L\left( {\theta^{Q} } \right) = \frac{1}{{N_{mb} }}{\Sigma }_{i = 1}^{{N_{{{\text{mb}}}} }} \left( {y_{{t_{i} }} - Q\left( {s_{{t_{i} }} , a_{{t_{i} }} } \right)} \right)^{2}$$

(8)

where $y_{{t_{i} }}$ is the TD target updated in the critic target network and policy target network. It can be expressed as

$$y_{{t_{i} }} = r_{{t_{i} }} + \gamma \mathop {\max }\limits_{a^{\prime}} Q\left( {s_{{t_{i} }}^{\prime} ,\mu \left( {s_{{t_{i} }}^{\prime} } \right)} \right)$$

(9)

The policy network takes the state as the input and the weight of policy network is updated to maximize the Q-value using the loss function formula

$$L\left( {\theta^{\mu } } \right) = - \frac{1}{{N_{mb} }}{\Sigma }_{t = 1}^{{N_{{{\text{mb}}}} }} Q\left( {s_{{t_{i} }} , \mu \left( {s_{{t_{i} }} } \right)} \right)$$

(10)

Using the loss function $L\left( {\theta^{\mu } } \right)$, the policy network is updated by the policy gradient as

$$\nabla_{{\theta^{\mu } }} J \approx \frac{1}{{N_{mb} }}{\Sigma }_{t = 1}^{{N_{{{\text{mb}}}} }} \nabla_{a} Q\left( {s_{t} ,a_{t} {|}\theta^{Q} } \right)\left. \right|_{{s_{t} = s_{{t_{i} }} , a = \mu \left( {s_{{t_{i} }} } \right)}} \nabla_{{\theta^{\mu } }} \mu \left( {s_{t} {|}\theta^{\mu } } \right)\left. \right|_{{s_{{t_{i} }} }}$$

(11)

where $J$ is expected value of rewards obtained by a given policy, and $\mu \left( {s_{t} {|}\theta^{\mu } } \right)$ represents the policy network. The weight of target network is softly updated as

$${\varvec{\theta}}_{t}^{{Q^{\prime}}} \leftarrow \tau {\varvec{\theta}}_{t}^{Q} + \left( {1 - \tau } \right){\varvec{\theta}}_{t}^{{Q^{\prime}}}$$

and

$${\varvec{\theta}}_{t}^{{\mu^{\prime}}} \leftarrow \tau {\varvec{\theta}}_{t}^{\mu } + \left( {1 - \tau } \right){\varvec{\theta}}_{t}^{{\mu^{\prime}}}$$

(12)

with the target soft update factor $\tau \ll 1$. Figure 4 represents the process of updating and learning in the DDPG model, along with the inputs and outputs of the variables. By performing ray-tracing simulation simultaneously with reinforcement learning, one can monitor the characteristics and performance of the communication system in real-time as the UAV position is updated. And also, reinforcement learning algorithm is able to reflect the current communication system immediately and optimizes the position of UAV with real-time information.

The algorithm below describes the update process of the proposed DDPG model used for position optimization of the UAV.

3 Results and discussion

We consider three different intersection areas near Sogang University, which are 'Daeheung', 'Sinchon', and 'Ewha' intersections. The geographical and building information for these areas are applied to the digital twin model. The digital twin for the Daeheung intersection covers the range of 150 m × 150 m, while the digital twins for Ewha and Sinchon intersections are constructed over the range of 300 m × 300 m centered around the origin. Actual buildings, terrains, and road infrastructure in the corresponding areas are incorporated in the digital twin. The ZF precoder is generated based on the ray-tracing channel, and the users are scheduled in a rank-adaptive fashion. When applying the average sum-rate to the reward, $K$ individual UEs are selected from the possible UE locations, and the sum-rate is evaluated accordingly. The carrier frequency values are selected from the millimeter-wave FR2 band.

The policy network of DDPG consists of two fully connected layers with ReLU activation functions. At the final layer where the action is derived, the hyperbolic tangent activation function is applied to constrain the action value between 0 and 1. Similarly, the critic network of DDPG consists of two fully connected layers with ReLU activation function. In the training process, we utilize 5000 training episodes with a replay buffer capacity of 100,000 and a mini-batch size of 8. Detailed Simulation parameters and DDPG network configurations are listed in Tables 3 and 4.

Table 3 Simulation parameters

Full size table

Table 4 DDPG network configuration

Full size table

To find the appropriate hyperparameter values such as the learning rate, discount factor, and target soft update factor, the performance comparison is made using different hyperparameter values applied to the DDPG algorithm. Figure 5 shows the comparison results to determine the chose values. The learning rate of 0.005, the discount factor of 0.9, and the soft update factor of 0.005 are chosen based on the experimental results (Table 5).

Table 5 DDPG hyperparameters

Full size table

Figure 6 shows the performance as the number of episodes in the iterative learning process increases at the Daeheung intersection digital twin. Due to the nature of DDPG which utilizes noise components for continuous exploration during the learning process, performance fluctuations are observed over the iterative process. To illustrate the performance trend as learning progresses, the moving average is displayed. The dashed lines on the average sum-rate graph represent the top 30%, 10%, and 1% performance obtained from the exhaustive search results. It can be observed that the DDPG algorithm achieves its maximum performance in proximity to the top 1% value. Figure 7 illustrates the trajectory of UAV movement as the training progresses. In the early stages of learning, the (x, y) coordinates exhibits significant fluctuations. However, after approximately 100 episodes of training, the UAV position converges with the altitude reaching around 50 m. Figure 8 represents the performance variations for three different target areas under consideration. The Daeheung intersection has a significantly smaller coverage area than other two regions, resulting in the improved sum-rate performance with stronger signal power and higher SINR values.

Figure 9 shows the performance comparison when two different types of noise processes are applied to the DDPG algorithm. Other than minor fluctuations over the learning process, no significant performance differences are observed for two cases, suggesting both the OU and Gaussian noise process can be applied for our purpose of position optimization.

Figure 10 shows the average sum-rate performance advantage of the proposed method over the non-twin scenarios. In the non-twin scenarios, exact knowledge of the transmission environment is not available to the ABS and multi-user beamforming is performed without the UAV position optimization based on detailed geographical features of the wireless channel. The rank-adaptive ZF transmission and discrete Fourier transform (DFT) codebook-based transmission is performed at selected UAV positions and the resulting average sum-rate is compared with those obtained when position optimization is applied using the digital twin model. As shown in the figure, the performance achieved using the DDPG and DQN algorithms with the aid of the constructed digital twin exhibits significantly higher sum-rate values, demonstrating the performance advantage of the proposed method utilizing the environment geometric information provided by the digital twin. The 3D coordinates in the figure legend indicate different UAV positions used for transmission non-twin scenarios (Table 6).

Table 6 DQN hyperparameters

Full size table

4 Conclusion

In this paper, we proposed a method to construct digital twins and to apply the DDPG reinforcement learning algorithm for UAV position optimization targeted to given geometric areas. The simulation results show that the digital twin based UAV optimization algorithm achieves the desired performance in terms of the received power, the SINR, and the UE sum-rate. The DDPG algorithm applied to multiple digital twins exhibited improving performance as the learning process progresses, suggesting the proposed method applies well to various communication channels with different environmental setups.

Availability of data and materials

Data sharing not applicable to this article during the current study.

Abbreviations

UAV:: Unmanned aerial vehicle
6G:: Sixth generation
ABS:: Aerial base station
UE:: User equipment
DDPG:: Deep deterministic policy gradient
GBS:: Ground base station
AI:: Artificial intelligence
LOS:: Line-of-sight
CSI:: Channel state information
MIMO:: Multiple-input multiple-output
3D:: Three-dimensional
MU:: Multi-user
SINR:: Signal-to-interference-plus-noise ratio
OSM:: OpenStreetMap
SUMO:: Simulation of Urban Mobility
2D:: Two-dimensional
DNN:: Deep neural network
TD:: Temporal-difference
OU:: Ornstein–Uhlenbeck

References

P. Tiwari, V. Gahlaut, M. Kaushik, P. Rani, A. Shastri, B. Singh, Advancing 5G connectivity: a comprehensive review of MIMO antennas for 5G applications. Int. J. Antennas Propag. 1, 19 (2023)
MATH Google Scholar
F.M.M.U. Chowdhury, S.J. Maeng, E. Bulut, I. Guvenc, 3-D trajectory optimization in UAV-assisted cellular networks considering antenna radiation pattern and backhaul constraint. IEEE Trans. Aerosp. Electron. Syst. 56(5), 3735–3750 (2020)
Article MATH Google Scholar
J. Zhang, W. Xu, H. Gao, M. Pan, Z. Han, P. Zhang, Codebook-based beam tracking for conformal array-enabled UAV mmWave networks. IEEE Internet Things J. 8(1), 244–261 (2021)
Article MATH Google Scholar
Z. Xiao, Z. Han, A. Nallanathan, O.A. Dobre, B. Clerckx, J. Choi, C. He, W. Tong, Antenna array enabled space/air/ground communications and networking for 6G. IEEE J. Sel. Areas Commun. 40(10), 2773–2804 (2022)
Article Google Scholar
Z. Lin, M. Lin, T. de Cola, J.-B. Wang, W.-P. Zhu, J. Cheng, Supporting IoT with rate-splitting multiple access in satellite and aerial-integrated networks. IEEE Internet Things J. 8(14), 11123–11134 (2021)
Article MATH Google Scholar
L. Sung, D.-H. Cho, Multi-user hybrid beamforming system based on deep neural network in millimeter-wave communication. IEEE Access 8, 91616–91623 (2020)
Article MATH Google Scholar
Y.-H. Xu, C.-C. Yang, M. Hua, W. Zhou, Deep deterministic policy gradient (DDPG)-based resource allocation scheme for NOMA vehicular communications. IEEE Access 8, 18797–18807 (2020)
Article Google Scholar
Y. Liao, Z. Yang, Z. Yin, X. Shen, DQN-based adaptive MCS and SDM for 5G massive MIMO-OFDM downlink. IEEE Commun. Lett. 27(1), 185–189 (2023)
Article Google Scholar
Y. Wu, K. Zhang, Y. Zhang, Digital twin networks: a survey. IEEE Internet Things J. 8(18), 13789–13804 (2021)
Article MATH Google Scholar
Z. Zhang, F. Wen, Z. Sun, X. Guo, T. He, C. Lee, Artificial intelligence-enabled sensing technologies in the 5G/internet of things era: from virtual reality/augmented reality to the digital twin. Adv. Intell. Syst. 4(7), 23 (2022)
Article Google Scholar
H.X. Nguyen, R. Trestian, D. To, M. Tatipamula, Digital twin for 5G and beyond. IEEE Commun. Mag. 59(2), 10–15 (2021)
Article Google Scholar
Z. Xiao, L. Zhu, Y. Liu, P. Yi, R. Zhang, X. Xia, R. Schober, A survey on millimeter-wave beamforming enabled UAV communications and networking. IEEE Commun. Surv. Tutor. 24(1), 557–610 (2022)
Article Google Scholar
S.A. Al-Ahmed, M.Z. Shakir, S.A.R. Zaidi, Optimal 3D UAV base station placement by considering autonomous coverage hole detection, wireless backhaul and user demand. J. Commun. Netw. 22(6), 467–475 (2020)
Article Google Scholar
C. Liu, W. Yuan, Z. Wei, X. Liu, D.W.K. Ng, Location-aware predictive beamforming for UAV communications: a deep learning approach. IEEE Wirel. Commun. Lett. 10(3), 668–672 (2021)
Article MATH Google Scholar
G. Hao, W. Ni, H. Tian, L. Cao, Mobility-aware trajectory design for aerial base station using deep reinforcement learning. 2020 International Conference on Wireless Communications and Signal Processing (WCSP) (2020), p. 1131–1136
K. Guo, X. Li, M. Alazab, R.H. Jhaveri, K. An, Integrated satellite multiple two-way relay networks: secrecy performance under multiple eves and vehicles with non-ideal hardware. IEEE Trans. Intell. Veh. 8(2), 1307–1318 (2023)
Article Google Scholar
M. Zhu, K. Guo, Y. Ye, L. Yang, T.A. Tsiftsis, H. Liu, Active RIS-aided covert communications for MISO-NOMA systems. IEEE Wirel. Commun. Lett. 12(12), 2203–2207 (2023)
Article Google Scholar
G. Castellanos, A. Colpaert, M. Deruyck, E. Tanghe, E. Vinogradov, L. Martens, W. Joseph, Evaluation of beamsteering performance in multiuserMIMO unmanned aerial base stations networks. IEEE Access 10, 62565–62580 (2022)
Article Google Scholar
B. Hazarika, K. Singh, C.-P. Li, A. Schmeink, K.F. Tsang, RADiT: Resource allocation in digital twin-driven UAV-aided internet of vehicle networks. IEEE J. Sel. Areas Commun. 41(11), 3369–3385 (2023)
Article Google Scholar
O. Bouhamed, H. Ghazzai, H. Besbes, Y. Massoud, Autonomous UAV Navigation: a DDPG-based deep reinforcement learning approach. 2020 IEEE International Symposium on Circuits and Systems (ISCAS) (2020), p. 1–5
T.M. Ho, K.-K. Nguyen, M. Cheriet, UAV control for wireless service provisioning in critical demand areas: a deep reinforcement learning approach. IEEE Trans. Veh. Technol. 70(7), 7138–7152 (2021)
Article MATH Google Scholar
G. Yang, H. Zheng, X.B. Zhai, J. Zhu, Energy-efficient cellular offloading optimization for UAV-aided networks. Proceedings of the International Conference on Neural Computing for Advanced Applications (NCAA) (2023), p. 286–301
A.S. Abdalla, A. Behfarnia, V. Marojevic, Aerial base station positioning and power control for securing communications: a deep Q-network approach. 2022 IEEE Wireless Communications and Networking Conference (WCNC) (2022), p. 2470–2475
P. Susarla, Y. Deng, M. Juntti, O. Sílven, Hierarchial-DQN position-aided beamforming for uplink mmWave cellular-connected UAVs. GLOBECOM 2022–2022 IEEE Global Communications Conference (2022), p. 1308–1313
L. Wang, K. Wang, C. Pan, N. Aslam, Joint trajectory and passive beamforming design for intelligent reflecting surface-aided UAV communications: a deep reinforcement learning approach. IEEE Trans. Mob. Comput. 22(11), 6543–6553 (2023)
MATH Google Scholar
M. Wu, K. Guo, Z. Lin, X. Li, K. An, Y. Huang, Joint optimization design of RIS-assisted hybrid FSO SAGINs using deep reinforcement learning. IEEE Trans. Veh. Technol. 73, 1–16 (2023)
Google Scholar
K. Guo, M. Wu, X. Li, H. Song, N. Kumar, Deep reinforcement learning and NOMA-based multi-objective RIS-assisted IS-UAV-TNs: trajectory optimization and beamforming design. IEEE Trans. Intell. Transp. Syst. 24(9), 10197–10210 (2023)
Article Google Scholar
Z. Lin, M. Lin, J.-B. Wang, T. de Cola, J. Wang, Joint beamforming and power allocation for satellite-terrestrial integrated networks with non-orthogonal multiple access. IEEE J. Sel. Top. Signal Process. 13(3), 657–670 (2019)
Article MATH Google Scholar
J. Peng, P. Zhang, L. Zheng, J. Tan, UAV positioning based on multi-sensor fusion. IEEE Access 8, 34455–34467 (2020)
Article MATH Google Scholar
M. Mahmood, A. Koc, T. Le-Ngoc, Spherical array-based joint beamforming and UAV positioning in massive MIMO systems. 2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring) (2023), p. 1–5
S. Zhang, X. Wan, Research on passive UAV localization model based on topology-Monte Carlo coupling algorithm. 2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA) (2023), p. 896–900
S. Chen, W. Meng, W. Xu, Z. Liu, J. Liu, F. Wu, A warehouse management system with UAV based on digital twin and 5G technologies. 2020 7th International Conference on Information, Cybernetics, and Computational Social Systems (ICCSS) (2020), p. 864–869
H. Xie, S. Tan, F. Ling, J. Wu, L. He, X. Zhang, Digital twin enabled dual-system reinforcement learning method. 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (2022), p. 2218–2223
Z. Lin, H. Niu, K. An, Y. Wang, G. Zheng, S. Chatzinotas, Y. Hu, Refracting RIS aided hybrid satellite-terrestrial relay networks: joint beamforming design and optimization. IEEE Trans. Aerosp. Electron. Syst. 58(4), 3717–3724 (2022)
Article Google Scholar
Z. Lin, M. Lin, B. Champagne, W.-P. Zhu, N. Al-Dhahir, Secrecy-energy efficient hybrid beamforming for satellite-terrestrial integrated networks. IEEE Trans. Commun. 69(9), 6345–6360 (2021)
Article MATH Google Scholar
M. Giordani, M. Polese, A. Roy, D. Castor, M. Zorzi, A tutorial on beam management for 3GPP NR at mmWave frequencies. IEEE Commun. Surv. Tutor. 21(1), 173–196 (2019)
Article Google Scholar
Z. Yun, M.F. Iskander, Ray tracing for radio propagation modeling: principles and applications. IEEE Access 3, 1089–1100 (2015)
Article MATH Google Scholar
K.R. Schaubach, N.J. Davis, T.S. Rappaport, A ray tracing method for predicting path loss and delay spread in microcellular environments. 1992 Proceedings Vehicular Technology Society 42nd VTS Conference–Frontiers of Technology (1992), p. 932–935
J. Hoydis, S. Cammerer, F.A. Aoudia, A. Vem, N. Binder, G. Marcus, A. Keller, Sionna: an open-source library for next-generation physical layer research (2022). arXiv preprint
Sionna 0.17.0 documentation. https://nvlabs.github.io/sionna/api/rt.html?highlight=ray%20tracing#sionna.rt.Scene.compute_paths. Accessed 25 May 2024
H. Ling, R.C. Chou, S.W. Lee, Shooting and bouncing rays: calculating the RCS of an arbitrarily shaped cavity. IEEE Trans. Antennas Propag. 37(2), 194–205 (1989)
Article MATH Google Scholar
S. Kasdorf, B. Troksa, C. Key, J. Harmon, B.M. Notaroš, Advancing accuracy of shooting and bouncing rays method for ray-tracing propagation modeling based on novel approaches to ray cone Angle calculation. IEEE Trans. Antennas Propag. 69(8), 4808–4815 (2021)
Article Google Scholar
T.P. Lillicrap et al., Continuous control with deep reinforcement learning (2015). arXiv preprint
E. Bibbona, G. Panfilo, P. Tavella, The Ornstein–Uhlenbeck process as a model of a low pass filtered white noise. Metrologia 45(6), 117–126 (2008)
Article MathSciNet Google Scholar

Download references

Funding

This research was supported by Seoul R&BD Program (CY230079) of Seoul Business Agency (SBA), Technology Development Program (RS-2023–00279316) of the Ministry of SMEs and Startups (MSS, Korea), and IITP grant of Korean government (MSIT) (RS-2024-00397480, System Development of Upper-mid Band Smart Repeater).

Author information

Authors and Affiliations

Department of Electronic Engineering, Sogang University, 35 Baekbeom-Ro, Mapo-Gu, Seoul, 04107, Korea
Jeongyoon Lee, Taeje Park & Wonjin Sung

Authors

Jeongyoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Taeje Park
View author publications
You can also search for this author in PubMed Google Scholar
Wonjin Sung
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J. Lee and W. Sung drafted the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Wonjin Sung.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, J., Park, T. & Sung, W. Digital twin based DDPG reinforcement learning for sum-rate maximization of AI-UAV communications. J Wireless Com Network 2024, 57 (2024). https://doi.org/10.1186/s13638-024-02386-0

Download citation

Received: 20 December 2023
Accepted: 04 July 2024
Published: 18 July 2024
DOI: https://doi.org/10.1186/s13638-024-02386-0

Digital twin based DDPG reinforcement learning for sum-rate maximization of AI-UAV communications

Abstract

Similar content being viewed by others

Deep Reinforcement Learning for Jointly Resource Allocation and Trajectory Planning in UAV-Assisted Networks

Trajectory optimization for UAV-assisted relay over 5G networks based on reinforcement learning framework

Delay Minimization in Multi-UAV Assisted Wireless Networks: A Reinforcement Learning Approach

1 Introduction