1 Introduction

Significant technical advancements are required to meet users' expectations for next-generation communication systems. Consequently, there is a demand for systems that can reliably support large-scale data transmission and reception, enabling key applications such as augmented/virtual reality, autonomous driving, high-precision product manufacturing, and real-time information sharing [1]. To address this demand, the integration of comprehensive networks in three-dimensional space combining terrestrial networks with aerial/satellite/space communication utilizing mobile aerial base stations (ABS) is under investigation [2,3,4,5], as well as practical means to apply the artificial intelligence (AI) algorithms [6,7,8] and digital twin technology [9,10,11].

Unmanned aerial vehicles (UAVs) serve as an economical and efficient means for implementing ABS, as they can be deployed to areas with concentrated data demand, ensuring line-of-sight (LOS) to user equipment (UE) for high-speed data transmission. UAV can dynamically change their positions, providing flexible coverage for target areas. This not only enhances the quality of mobile communication services but also has been proven effective in scenarios such as disaster situations or temporary surges in communication demand [12, 13]. The wireless communication system utilizing an ABS consists of a backhaul link between ground base stations (GBS) and the ABS, as well as an access link from the UAV to ground UEs. Due to the mobility of the UAV, both the backhaul and access links exhibit time-varying characteristics. Therefore, high-precision beamforming technology is required to ensure the optimal transmission performance.

Optimizing communication systems by leveraging the high utility characteristics of UAVs and AI algorithms has emerged as an important research topic. For UAV position prediction, a long short-term memory based recurrent neural network was designed [13] and new double deep Q-network model has been proposed that maximizes the sum-rate in the uplink communication system where the UAV acts as a base station (BS) [14]. The results assume the stochastic loss channel model and relies on the assumption of having perfect channel state information (CSI) [14, 15]. The use of random waypoint mobility model and reference point group mobility model to generate UE locations introduces a drawback, as these assumptions may deviate from reality [15]. In [16], transmission links are considered to suffer shadowed-Rician and Rayleigh fading. In [17], it is assumed that all the channels experience quasi-static block fading and assumed statistic CSI or perfect CSI, which can be applied to evaluate the system performance in various environmental circumstances. Utilizing UAVs as ABSs with the hybrid beamforming capability using massive multiple-input multiple-output (MIMO) has been studies in [18], where bicycle UEs are generated based on the actual roads in Belgium. Signal attenuation was calculated using ray-tracing to determine the channel and beamforming was applied based on the ray-tracing channel. The utilization of a UAV as a node between road side units and vehicles with the aid of the deep learning-based resource allocation algorithm has been presented in [19]. In this study the variables of environmental factors including the time slots and local computation capacity are defined as digital representation.

Optimization of UAV-related parameters has also been intensively studied, especially focusing on the development and utilization of AI algorithms in the optimization process. Reinforcement learning algorithms such as deep deterministic policy gradient (DDPG) and deep Q-network (DQN) are commonly utilized in many application areas such as using DDPG models to train UAVs to safely navigate to their destination while avoiding obstacles [20], and managing UAVs using DDPG models to maximize energy efficiency [21, 22]. Also, the DQN has been utilized for optimizing the transmission power and the location of the UAV ABS for maximizing the data rate of ground UE [23]. A hierarchical deep Q-network was developed for the beam alignment between UAV UE and BS in the millimeter-wave band [24]. In this case, location information reduces the computational complexity during beam searching and efficiently maximizes the data rate. Optimization of UAV BS trajectory and phase shifts of intelligent reflecting surface elements to maximize the data rate and energy efficiency of UE is discussed in [25], where the performance of DQN and DDPG is compared. The asymmetric long short-term memory-deep deterministic policy gradient algorithm is proposed to improve the system ergodic sum-rate in satellite-aerial-ground integrated networks [26], and the multi-objective deep deterministic policy gradient algorithm is designed for trajectory optimization and beamforming design when UAVs are utilized as reconfigurable intelligent surfaces [27]. Also, joint optimization design for satellite terrestrial integrated network is proposed for the sum-rate maximization in [28]. The Kalman filter based method, particle swarm optimization, and the Monte-Carlo coupling algorithm to optimize the operation of UAVs can be found in [29,30,31].

AI algorithms can be utilized to effectively operate UAVs, offering the advantage of automatically optimizing UAV parameters based on specific environmental conditions. In the process of acquiring the operating environment for UAVs, one can construct a digital twin for accurate simulation of the wireless channel and user mobility [9]. A versatile utilization of UAVs can be achieved by constructing a digital twin of the indoor space using the Unity software, allowing for real-time remote control of UAVs [32]. 3-dimensional (3D) laser scanning can collect the point cloud data of the indoor workspace and various 3D software tools such as 3ds Max can be used to create outline models of equipment. A highly realistic digital twin can be created by importing data into a rendering platform in FBX format [33]. Multiple-input single-output (MISO) downlink transmission modelling for UAV communications in such environments can be found in recent excellent works in [34] and [35]. These cases often involve performance evaluations based on ideal assumptions and virtual environments. Applying such virtual environments to real-world scenarios can introduce non-economic and impractical aspects due to the disparities with the actual environment. To ensure more economical and practical operation of a UAV as the ABS, it is essential to implement real-time analysis and a controllable digital twin system. Such a system should consider the characteristics and conditions of time-varying environments, including user mobility, positions, base station locations, and phenomena such as radio scattering and refraction. Utilizing a digital twin allows virtual simulation and testing in various technological areas before actual deployment or operation, leading to advantages in saving time and costs. In real-world scenarios where information is distributed and comprehensive monitoring is challenging, a digital twin environment enables integrated management by consolidating operational data, addressing the limitations of decentralized information.

The contributions of this paper are follows. First, we present a systematic procedure of constructing the digital twin applicable to environment-aware channel coefficient generations for wireless communication systems. The procedure can be adopted for high-precision performance simulations and predictions in many wireless systems using the set of available software packages explained in this paper. By using the ray-tracing algorithm combined with exact building and terrain information of the environment, the resulting performance simulation provides a highly accurate position-dependent channel model unlike probabilistic channel modelling used in existing 3GPP specifications. Second, we demonstrate how the constructed digital twin can be utilized for performance enhancement of the wireless system by combining the twin model with AI algorithms. In particular, we apply reinforcement learning to optimize UAV positioning within the digital twin control system to maximize the sum-rate of multi-user signal transmission. This optimized position information can then be fed back to the physical environment for desired UAV operation. In practical operational scenarios, the reinforcement learning algorithm is updated in real-time to maximize the reward to produce the desired position for ABS signal transmission. Additionally, we validate our approach by comparing the resulting performance in operational scenarios with and without the digital twin. The rest of the paper is organized as follows. In Sect. 2, the system model, the construction procedure of the digital twin, ray-tracing operation, and the DDPG algorithm are explained. Section 3 includes experimental evaluation results and related discussions. The conclusion is given in Sect. 4. The parameters and notation used in this paper is listed in Tables 1 and 2.

Table 1 Parameters and notation for the signal model, the digital twin, and ray-tracing
Table 2 Parameters and notations for DDPG

2 Methods/experimental

2.1 Signal model

A muti-user MISO downlink transmission scenario is considered, where the users receive signal when BS transmits the signal with \(M_{T}\) antenna elements to \(K\) single-antenna UEs. The received signal vector \({\varvec{y}} = \left[ {y_{1} , y_{2} , \ldots , y_{K} } \right]^{{\text{T}}}\) for \(K\) users is expressed as

$${\varvec{y}} = {\varvec{HFs}} + {\varvec{n}}$$
(1)

where \({\varvec{H}} = \left[ {{\varvec{h}}_{1}^{{\text{T}}} ,{\varvec{h}}_{2}^{{\text{T}}} ,\user2{ } \ldots ,{\varvec{h}}_{K}^{{\text{T}}} } \right]^{{\text{T}}} \in C^{{K \times M_{T} }}\) is the channel matrix including the kth user's channel vector \({\varvec{h}}_{k} = \left[ {h_{k,1} ,h_{k,2} , \ldots ,h_{{k,M_{T} }} } \right]\). The coefficients of each channel vector are obtained from digital twin based ray-tracing simulation process. Also, \({\varvec{F}} = \left[ {{\varvec{f}}_{1} {, }{\varvec{f}}_{2} {, }...{, }{\varvec{f}}_{K} } \right] \in C^{{M_{T} \times K}}\) is the beamforming matrix including the vector \({\varvec{f}}_{k} = \left[ {f_{k,1} ,f_{k,2} , \ldots ,f_{{k,M_{T} }} } \right]^{{\text{T}}}\) with power constraint \({\text{Tr}}\left( {{\varvec{f}}_{k} {\varvec{f}}_{k}^{{\text{H}}} } \right) \le M_{T} /K\). The transmitted signal vector is denoted as \({\varvec{s}} = \left[ {s_{1} , s_{2} , \ldots , s_{K} } \right]^{{\text{T}}}\) with \(E\left[ {s_{k}^{2} } \right] = 1\), and \({\varvec{n}} = \left[ {n_{1} , n_{2} , \ldots , n_{K} } \right]^{{\text{T}}}\) is the additive white Gaussian noise vector with \(E\left[ {n_{k}^{2} } \right] = \sigma_{n}^{2}\). The received signal-to-interference-plus-noise ratio (SINR) is given by

$$\Gamma_{k} { = }\frac{{\left| {{\varvec{h}}_{k} {\varvec{f}}_{k} } \right|^{2} }}{{\mathop \sum \nolimits_{l \ne k} \left| {{\varvec{h}}_{k} {\varvec{f}}_{{\varvec{l}}} } \right|^{2} + \left| {n_{k} } \right|^{2} }}$$
(2)

and the achievable sum-rate for all UEs is determined as

$$R = \mathop \sum \limits_{k}^{K} \log 2\left( {1 + \Gamma_{k} } \right).$$
(3)

2.2 Digital twin construction

The digital twin enables various accurate simulations. In our digital twin, ray-tracing simulation and DDPG reinforcement learning are conducted. The digital twin provides useful and accurate information and predictions to the real space in a sufficiently small cycle denoted by symbol \(t_{{{\text{update}}}}\). In 5G new radio communication, the connection between the UE and the network is established through the initial access process, providing signals for channel estimation and beam selection [36]. Once the connection is established via the initial access procedure, the channel information is continuously updated using the channel state information reference signalling (CSI-RS) in the time intervals of \(\tau \in\) {5, 10, 20, 40, 80, 160} ms. The proposed optimization process for UAV based on the digital twin outputs the optimal position of the UAV at every \(t_{{{\text{update}}}}\) time instance, as depicted in Fig. 1. Here, \(t_{{{\text{update}}}}\) is set shorter than the minimum value of 5G reference signalling intervals, enabling enhanced communication performance to occur. In a communication system where both UEs and UAVs possess velocities, the signal coherence time becomes finite. Therefore, \(t_{{{\text{update}}}}\) is adjusted to be shorter than the overall coherence time.

Fig. 1
figure 1

Utilization of the digital twin to enhance the transmission performance in the physical channel

To construct the digital twin, information such as the map, the terrain contour, building shapes, and the positions from physical space are applied for 3D modelling to accurately estimate the channel. Once the 3D model is constructed, the UAV position information and the UE locations acquired by the GPS in the physical space are applied to the digital twin. The UE locations can be actual values acquired by additional sensors. In our simulation, typical UE behaviours are obtained by Simulation of Urban MObility (SUMO), a traffic simulation software that supports various modes of transportation and enables the generation of traffic flow based on the embedded road information. These UE behaviours are applied to the previously generated 3D model to reflect the terrain. During the experimentation process, reinforcement learning is conducted using fixed UE location data assumed to represent the real-time positions of a moving vehicles generated from SUMO. These fixed UE location data result from sampling the movement of vehicles. Using these locations, exact channel coefficients are generated using the ray-tracing procedure, which are used to estimate the optimized position of the ABS via DDPG reinforcement learning. The digital twin functionalities can be implemented as a part of the GBS management software, conducting the ray-tracing and DDPG position optimization procedure to determine the ABS position. Only one UAV ABS is assumed to be used within the target area of interest, and the wireless backhaul link to the UAV is assumed to be perfect with no interference between the GBS downlink signal and the UAV ABS downlink signal.

Open source tools are applied to generate the 3D model, incorporating the position and contour information of surrounding terrains and buildings. Blender, a 3D computer graphics software, facilitates the import of various data formats into a workspace and enables the editing of objects' shapes or movements. Blender also allows users to download and utilize extra add-ons, such as BlenderGIS, to create the 3D communication environment model. The process of importing data into the Blender workspace involves the sequential execution of tasks including obtaining satellite images from Google Maps, topographical data from OpenTopography, and 3D building information from OpenStreetMap. This order of operations ensures the building placement to follow the contours of the terrain. However, the imported topographical data may exhibit variations compared to Google Earth depending on the region. Adjustments are made to the elevation intensity of the terrain, referencing Google Earth, to closely align the terrain data with that of Google Earth. Furthermore, the building information imported from OpenStreetMap exhibits differences in aspects such as height and provides only a rough approximation of building shapes when compared to S-Map, which is LiDAR based 3D Map made by Seoul Metropolitan Government. To enhance the accuracy, building information is supplemented by referencing S-Map, leading to the addition of buildings in accordance with its data. Mitsuba Blender is a Blender Add-on to export 3D communication environment model as XML data format. To perform ray-tracing based on the digital twin, the relevant add-on was employed to save the 3D model in XML format. This add-on is only supported in Blender V3.4.1. SUMO traffic generation is based on OpenStreetMap road information. OpenStreetMap is an open-source map which provides diverse information, including roads, trails, railways, buildings, and other features worldwide. It is used in BlenderGIS add-on to import building information, and SUMO to import road information. Sionna is an open-source communication simulation library developed by NVIDIA based on TensorFlow and Python. It enables link-level simulations for the physical layer of wireless communication systems and supports Python versions 3.6–3.9 and TensorFlow versions 2.10 and above. Installing NVIDIA GPU drivers, CUDA, and cuDNN allows for accelerated simulations using GPU.

To initiate terrain modelling, the BlenderGIS Add-on is employed to import the image plane of the desired region from Google Maps and subsequently integrate 3D topographic details provided by OpenTopography. It is crucial to verify the accuracy of the terrain information across different regions. Adjustments to the terrain contours were made using the strength parameter in Blender’s Modifier Properties. Terrain refinements include adjustments to the coordinate space, such as the origin and orientation of the digital twin. After applying the terrain contours, the 3D building information from OpenStreetMap is imported into the workspace. It is important to note that depending on the location and building size, OpenStreetMap may not have implemented buildings in certain areas. In such cases, extra building objects are created within Blender, referencing Google Maps, to accurately represent the buildings in the specified region. Subsequent to modelling the 3D communication environment through the aforementioned process, material properties of objects were specified to facilitate ray-tracing based on Sionna. Guidance on object material assignments can be found in the documentation provided in [30]. The material of objects in the digital twin has an impact on ray-tracing when signal reflection occurs, as the attenuation of the signal path is influenced by the material properties of the reflector.

UE modelling is performed by configuring vehicle generation parameters and the environment using SUMO. The target area’s road information is downloaded from OpenStreetMap in OSM data format and transformed into NET.XML format to be compatible with SUMO. Given the potential disparities between the downloaded road information and the actual environment, adjustments were made in SUMO after importing road information, including modification of lane count, lane direction, and shape. The SUMO program and JSON-formatted code were then utilized to configure parameters such as the number of vehicles, vehicle coordinate sampling interval, lane-change behaviour, and traffic signal generation. SUMO is executed to generate 2-dimensional (2D) vehicle coordinate data using python-based programming. According to Blender and SUMO documentation, the unit of parameters such as lane length and object coordinates in Blender and SUMO is meter, while the sampling interval of vehicle coordinates in SUMO is second. The 3D communication environment model and 2D vehicle coordinate data are incorporated to determine the altitude of the terrain in Blender. This information is used to update the 2D coordinate into 3D vehicle coordinate data, which is the final step in modelling the digital twin UE. Figure 2 shows the visualization of digital twin model for the Daeheung intersection near Sogang University Campus in Seoul. The origin is situated at the centre of the intersection, and an area of dimension 150 m × 150 m is created. UE locations are indicated by blue cars in figure parts (a) and (b), which respectively are south and west view of the intersection. The ABS location is shown by the red UAV in these figure parts. These figures illustrate the scenario of the UAV at the example position (0, 0, 50) [m], and the UAV position changes throughout the reinforcement learning process. The total number of UEs is 57, positioned within a 150 m × 150 m range on the xy-plane and spanning from −10 to 10 m in the z-direction, following the terrain's contours. In figure parts (c) and (d), the locations of the transmitter and the receivers are respectively depicted in red and blue circles for clearer representation.

Fig. 2
figure 2figure 2

The digital twin construction for the intersection near Sogang University Campus: a South view, b west view, c south view with Tx/Rx locations indicated by circles, d west view with Tx/Rx locations indicated by circles

2.3 Ray-tracing operation

Ray-tracing is a method of computing the signal propagation model with modified geometrical method [37, 38]. Computing the direct rays and reflected or diffracted rays is the main process of ray-tracing. Direct rays are called line-of-sight (LoS) rays and reflection/diffraction rays are called non line-of-sight (NLoS) rays. Channel coefficients generated using ray-tracing based on exact geographical characteristics are known to provide highly accurate resemblance to physical channel values determined in the same experimental settings We adopt the newly developed Sionna software libraries in Python for ray-tracing operation [39], which are well matched to 3D geometry information provided by the digital twin. The shoot-and-bouncing rays algorithm implemented in Sionna has been verified to produce very accurate modelling of the physical channel in extensive investigation results [40,41,42]. We apply precise transmitter and receiver locations within the pre-defined geometry, and exact transmission parameters such as given in Table 1 to generate channel coefficients used for performance evaluation. While conventional stochastic channel models define the number and distribution patterns of clusters to determine path loss, angles of arrival and departure, and delay spread in a probabilistic fashion, the digital twin model provides deterministic channel parameters with proven accuracy at geometric locations of interest. The channel coefficients from the mth antenna to the kth user can be expressed as

$$h_{k,m} = \mathop \sum \limits_{{n_{p} = 1}}^{{N_{p} }} a_{{n_{p} }} e^{{j2\pi f_{c} \tau_{{n_{p} }} }} \delta \left( {\tau - \tau_{{n_{p} }} } \right) \approx \mathop \sum \limits_{{n_{p} = 1}}^{{N_{p} }} a_{{n_{p} }} e^{{j2\pi f_{c} \tau_{{n_{p} }} }}$$
(4)

where \(n_{p}\) is the index of \(N_{p}\) multi-paths, \(a_{{n_{p} }}\) is the path attenuation, and \(\tau_{{n_{p} }}\) is the path delay of \(n_{p}\)th multi-path. For most multi-carrier transmissions with sufficient symbol duration, the path delay \(\tau_{{n_{p} }}\) is negligibly small, averaging about 280 ns in scale. Hence, it is approximately ignored, leading to the rough disregard of the delta term in Eq. (4). Consequently, the channel coefficient is approximated by summing the path coefficients of each multi-path. Figure 3 is the visualization of ray-tracing result in the digital twin when UAV is located at \(\left( {x, y, z} \right) = \left( {0,{ }0,{ }50} \right)\) [m].

Fig. 3
figure 3

Illustration of ray-tracing for channel coefficient generation inside the digital twin

2.4 DDPG for UAV position optimization

We apply the DDPG algorithm to maximize the overall sum-rate performance of wireless access link via the UAV position optimization. DDPG is a reinforcement learning algorithm that extends the concept of DQN to handle continuous action spaces. Developed as an enhancement of the existing deterministic policy gradient algorithm, DDPG incorporates mechanisms such as experience replay, nonlinear function approximation, and target networks to improve the efficiency of data sampling and enhance the stability of learning [33]. DDPG enables an agent to execute continuous actions in a continuous state space and learns a policy in a model-free environment to perform a specific action at a specific state. It consists of 4 deep neural networks (DNNs): Policy network, policy target network, critic network, and critic target network, along with one experience replay buffer. Given the state \(s\) and the next state \(s{\prime}\), action \(a\), and the reward \(r\), the experience replay buffer stores tuples represented as \(\left( {s, a, r, s^{\prime} } \right)\).

Markov decision process (MDP) can be used to explain various considerations in DDPG. The MDP consists of state space, action space, reward function, policy function \(\pi\), and state transition probability. When precise state transition probabilities are not readily available as in typical real-world situations, solving MDPs can be achieved through DDPG learning. State space information represents the information on the objects after the agent interacts with the environment. The DDPG environment consists of UAV ABS within digital twin and downlink communication system from UAV ABS to UEs. We have set the state to represent the position of the UAV in the 3D coordinate system, to express the state as vector \({\varvec{s}} = \left( {s_{1} , s_{2} , s_{3} } \right)\). Each element represents the coordinate along the \(x\), \(y\), and \(z\)-axis. Action space represents a group of available actions, and action is the output of policy function with its input as a state. The agent learns the policy network to maximize the Q-value, and then performs action obtained from it. Q-value is the expected value of rewards throughout the learning process. Action is generated by adding noise to the output of the policy network and has dimension of 3, so it can be expressed as \({\varvec{a}} = \left( {a_{1} , a_{2} , a_{3} } \right) = \pi \left( {\varvec{s}} \right) + {\varvec{n}}_{{\varvec{l}}}\), where \({\varvec{n}}_{{\varvec{l}}}\) is added noise vector. The noise provides exploration capabilities to the DDPG model, thereby facilitating enhanced learning. Commonly used types of noise are the Ornstein–Uhlenbeck (OU) noise and the Gaussian noise [43]. The Ornstein–Uhlenbeck process is a random process which is an improved mathematical model for the white noise that accounts for the filtering of high frequencies, generating the noise which relies on previous states [44]. Therefore, OU noise was chosen instead of Gaussian noise based on the characteristic of reinforcement learning where previous learning influences subsequent learning.

After the agent performs an action, the reward and the Q-value are updated. Maximizing the Q-value is the goal of DDPG learning because it leads to the maximum reward throughout the learning process. The agent interacts with the environment by performing actions in a given state and receiving rewards. We determine the sum-rate by performing rank-adaptive zero-forcing (ZF) beamforming using the generated channel coefficient. The average sum-rate is set as the reward, which is expressed as

$$r = \frac{1}{{N_{R} }}\mathop \sum \limits_{n}^{{N_{R} }} \mathop \sum \limits_{k}^{K} \log_{2} \left( {1 + \Gamma_{k} } \right)$$
(5)

where \(N_{R}\) represents the transmission rank. The policy function determines which action the agent will take based on the state, aiming to maximize the Q-value. In DDPG model, the policy network serves the role of the policy function, and it is updated in the direction of maximizing the Q-value. Agent takes an action in the environment by adding noise to the output of the policy network and using it as action. We determined the optimal position of the UAV for the next state as

$${\varvec{s}}^{\prime} = \left( {s_{1}^{\prime} , s_{2}^{\prime} , s_{3}^{\prime} } \right) = \left( {\alpha_{1} a_{1} + c_{1} , \alpha_{2} a_{2} + c_{2} ,\alpha_{3} a_{3} + c_{3} } \right)$$
(6)

where \(\{ \alpha_{1}\), \(\alpha_{2}\), \(\alpha_{3} \}\) is the set of scaling factors and \(\{ c_{1}\), \(c_{2}\), \(c_{3} \}\) is the set of offset values which determines the range of optimized UAV position. DDPG learning progresses by continuously updating the behaviour network and target network. The behavior network consists of both the critic network and the policy network. The critic network takes both the state and the action as input and determines the Q-value given by

$$Q^{\mu } \left( {s_{{t_{i} }} , a_{{t_{i} }} } \right) = {\text{E}}_{{r_{{t_{i} }} , s_{{t_{i} }}{\prime} \sim {\mathcal{B}}}} \left[ {r_{{t_{i} }} + \gamma Q^{\mu } \left( {s_{{t_{i} }}^{\prime} , \mu (s_{{t_{i} }}^{\prime} } \right)} \right)$$
(7)

where \({\mathcal{B}} = \left\{ {\left( {s_{{t_{1} }} , a_{{t_{1} }} , r_{{t_{1} }} , s_{{t_{1} }}^{\prime} } \right), \ldots ,\left( {s_{{t_{{N_{mb} }} }} , a_{{t_{{N_{mb} }} }} , r_{{t_{{N_{mb} }} }} , s_{{t_{{N_{mb} }} }}^{\prime} } \right)} \right\}\) is the randomly sampled mini-batch from experience replay buffer \({\mathcal{D}}\). The weights of critic network are updated in the direction of minimizing the temporal-difference (TD) error

$$L\left( {\theta^{Q} } \right) = \frac{1}{{N_{mb} }}{\Sigma }_{i = 1}^{{N_{{{\text{mb}}}} }} \left( {y_{{t_{i} }} - Q\left( {s_{{t_{i} }} , a_{{t_{i} }} } \right)} \right)^{2}$$
(8)

where \(y_{{t_{i} }}\) is the TD target updated in the critic target network and policy target network. It can be expressed as

$$y_{{t_{i} }} = r_{{t_{i} }} + \gamma \mathop {\max }\limits_{a^{\prime}} Q\left( {s_{{t_{i} }}^{\prime} ,\mu \left( {s_{{t_{i} }}^{\prime} } \right)} \right)$$
(9)

The policy network takes the state as the input and the weight of policy network is updated to maximize the Q-value using the loss function formula

$$L\left( {\theta^{\mu } } \right) = - \frac{1}{{N_{mb} }}{\Sigma }_{t = 1}^{{N_{{{\text{mb}}}} }} Q\left( {s_{{t_{i} }} , \mu \left( {s_{{t_{i} }} } \right)} \right)$$
(10)

Using the loss function \(L\left( {\theta^{\mu } } \right)\), the policy network is updated by the policy gradient as

$$\nabla_{{\theta^{\mu } }} J \approx \frac{1}{{N_{mb} }}{\Sigma }_{t = 1}^{{N_{{{\text{mb}}}} }} \nabla_{a} Q\left( {s_{t} ,a_{t} {|}\theta^{Q} } \right)\left. \right|_{{s_{t} = s_{{t_{i} }} , a = \mu \left( {s_{{t_{i} }} } \right)}} \nabla_{{\theta^{\mu } }} \mu \left( {s_{t} {|}\theta^{\mu } } \right)\left. \right|_{{s_{{t_{i} }} }}$$
(11)

where \(J\) is expected value of rewards obtained by a given policy, and \(\mu \left( {s_{t} {|}\theta^{\mu } } \right)\) represents the policy network. The weight of target network is softly updated as

$${\varvec{\theta}}_{t}^{{Q^{\prime}}} \leftarrow \tau {\varvec{\theta}}_{t}^{Q} + \left( {1 - \tau } \right){\varvec{\theta}}_{t}^{{Q^{\prime}}}$$

and

$${\varvec{\theta}}_{t}^{{\mu^{\prime}}} \leftarrow \tau {\varvec{\theta}}_{t}^{\mu } + \left( {1 - \tau } \right){\varvec{\theta}}_{t}^{{\mu^{\prime}}}$$
(12)

with the target soft update factor \(\tau \ll 1\). Figure 4 represents the process of updating and learning in the DDPG model, along with the inputs and outputs of the variables. By performing ray-tracing simulation simultaneously with reinforcement learning, one can monitor the characteristics and performance of the communication system in real-time as the UAV position is updated. And also, reinforcement learning algorithm is able to reflect the current communication system immediately and optimizes the position of UAV with real-time information.

Fig. 4
figure 4

UAV position optimization DDPG algorithm

The algorithm below describes the update process of the proposed DDPG model used for position optimization of the UAV.

figure a

3 Results and discussion

We consider three different intersection areas near Sogang University, which are 'Daeheung', 'Sinchon', and 'Ewha' intersections. The geographical and building information for these areas are applied to the digital twin model. The digital twin for the Daeheung intersection covers the range of 150 m × 150 m, while the digital twins for Ewha and Sinchon intersections are constructed over the range of 300 m × 300 m centered around the origin. Actual buildings, terrains, and road infrastructure in the corresponding areas are incorporated in the digital twin. The ZF precoder is generated based on the ray-tracing channel, and the users are scheduled in a rank-adaptive fashion. When applying the average sum-rate to the reward, \(K\) individual UEs are selected from the possible UE locations, and the sum-rate is evaluated accordingly. The carrier frequency values are selected from the millimeter-wave FR2 band.

The policy network of DDPG consists of two fully connected layers with ReLU activation functions. At the final layer where the action is derived, the hyperbolic tangent activation function is applied to constrain the action value between 0 and 1. Similarly, the critic network of DDPG consists of two fully connected layers with ReLU activation function. In the training process, we utilize 5000 training episodes with a replay buffer capacity of 100,000 and a mini-batch size of 8. Detailed Simulation parameters and DDPG network configurations are listed in Tables 3 and 4.

Table 3 Simulation parameters
Table 4 DDPG network configuration

To find the appropriate hyperparameter values such as the learning rate, discount factor, and target soft update factor, the performance comparison is made using different hyperparameter values applied to the DDPG algorithm. Figure 5 shows the comparison results to determine the chose values. The learning rate of 0.005, the discount factor of 0.9, and the soft update factor of 0.005 are chosen based on the experimental results (Table 5).

Fig. 5
figure 5

Performance comparison of the DDPG algorithms using different hyperparameter values: a learning rate, b discount factor, and c soft update factor

Table 5 DDPG hyperparameters

Figure 6 shows the performance as the number of episodes in the iterative learning process increases at the Daeheung intersection digital twin. Due to the nature of DDPG which utilizes noise components for continuous exploration during the learning process, performance fluctuations are observed over the iterative process. To illustrate the performance trend as learning progresses, the moving average is displayed. The dashed lines on the average sum-rate graph represent the top 30%, 10%, and 1% performance obtained from the exhaustive search results. It can be observed that the DDPG algorithm achieves its maximum performance in proximity to the top 1% value. Figure 7 illustrates the trajectory of UAV movement as the training progresses. In the early stages of learning, the (x, y) coordinates exhibits significant fluctuations. However, after approximately 100 episodes of training, the UAV position converges with the altitude reaching around 50 m. Figure 8 represents the performance variations for three different target areas under consideration. The Daeheung intersection has a significantly smaller coverage area than other two regions, resulting in the improved sum-rate performance with stronger signal power and higher SINR values.

Fig. 6
figure 6

Performance of the DDPG learning algorithm in terms of a the received power, b the SINR, and c the sum-rate. Blue line: Actual value. Red line: Moving average using 100 samples. Dashed line: Top 1% performance. Dash-single dotted line: Top 10% performance. Dash-double dotted line: Top 30% performance

Fig. 7
figure 7

UAV position adjustment with DDPG learning: a UAV trajectory within the digital twin, b UAV coordinate variations. Blue line: x-coordinate. Orange line: y-coordinate. Green line: z-coordinate

Fig. 8
figure 8

Average sum-rate performance for three different target areas. Blue line: Ewha. Orange line: Sinchon. Green line: Daeheung

Figure 9 shows the performance comparison when two different types of noise processes are applied to the DDPG algorithm. Other than minor fluctuations over the learning process, no significant performance differences are observed for two cases, suggesting both the OU and Gaussian noise process can be applied for our purpose of position optimization.

Fig. 9
figure 9

Performance comparison of the DDPG algorithm using different noise process. Red line: Gaussian noise. Green line: OU noise

Figure 10 shows the average sum-rate performance advantage of the proposed method over the non-twin scenarios. In the non-twin scenarios, exact knowledge of the transmission environment is not available to the ABS and multi-user beamforming is performed without the UAV position optimization based on detailed geographical features of the wireless channel. The rank-adaptive ZF transmission and discrete Fourier transform (DFT) codebook-based transmission is performed at selected UAV positions and the resulting average sum-rate is compared with those obtained when position optimization is applied using the digital twin model. As shown in the figure, the performance achieved using the DDPG and DQN algorithms with the aid of the constructed digital twin exhibits significantly higher sum-rate values, demonstrating the performance advantage of the proposed method utilizing the environment geometric information provided by the digital twin. The 3D coordinates in the figure legend indicate different UAV positions used for transmission non-twin scenarios (Table 6).

Fig. 10
figure 10

Performance comparison of the DDGP, DQN, and non-twin scenarios

Table 6 DQN hyperparameters

4 Conclusion

In this paper, we proposed a method to construct digital twins and to apply the DDPG reinforcement learning algorithm for UAV position optimization targeted to given geometric areas. The simulation results show that the digital twin based UAV optimization algorithm achieves the desired performance in terms of the received power, the SINR, and the UE sum-rate. The DDPG algorithm applied to multiple digital twins exhibited improving performance as the learning process progresses, suggesting the proposed method applies well to various communication channels with different environmental setups.