1 Introduction

With the large-scale commercial deployment of the fifth-generation (5 G) mobile networks and the industrial internet of things (IIoT), ultra-reliable and low-latency communication (URLLC) is urgently needed in many mission-critical applications, such as the automotive industry, remote surgery, and automotive-driving [1, 2]. Since the URLLC scenarios have the extremely stringent performance requirements, there exists increasing interest in the development of new communication technologies to increase the reliability and reduce the latency. For example, the wireless transmission such as UAV communication will replace wired connection in IIoT to increase flexibility and reduce the infrastructure cost [3]. The above problem imposes challenges on the wireless transmission subject to latency and reliability.

URLLC has been regarded as one of the three pillar applications in the 5 G communications, with application scenarios including factory automation, automatic driving, remote surgery [4]. The essential targets established by the international organization for standardization has to achieve 99.999% reliability and 1ms latency in URLLC communication system [4]. The standard is designed to support future industrial automation and factory deployments through wireless communication. In URLLC systems, subcarriers play a critical role in achieving the high levels of reliability and low latency required for applications [5]. However, it is still challenging to achieve the latency and reliability objectives stated in the URLLC standard. Therefore, developed 5 G new radio technology to derive URLLC performance to satisfy the communications requirements of industrial automation. Mini-slot structures have been proposed as a potential solution for supporting short-packet communications [6]. This is because most existing wireless communication systems are designed for long packet transmissions, for which Shannon’s theorem about communication rate is valid. However, the Shannon’s theorem rate formula is not suitable for the short-packet communication. Fortunately, a unified scheme for establishing strict limitations on coding rates under short-packet assumptions is available in [7]. In the finite blocklength regime, the authors in [8] adopted the SNR distribution to derive the corresponding average rate and average error probability. Additionally, the authors in [9] jointly improved blocklength and power allocation to minimize the probability of decoding errors. However, the URLLC network requires real-time data transmission for IIoT applications. UAV-assisted wireless could play an important role in solving this challenge, which can make a fast decision and be a controllable move to the ground users to reduce the pathloss[10].

Due to the high mobility, wide coverage, and low latency, UAV communication have emerged as a promising technology for driving URLLC services in the IIoT [11]. The utilization of UAVs in the IIoT can improve the automation capabilities, and data analysis of industrial processes, leading to potential benefits such as improved reliability, cost reduction and wide coverage. UAV-assisted wireless can achieve a high sum rate for the primary line-of-sight (LoS) connection with the ground user in comparison with typical wireless communications [10]. It can dynamically adjust its positions to optimize performance and meet quality-of-service requirements. UAVs can function as relays or base station (BS) to establish connections between transmitting and receiving devices [12, 13]. To enhance URLLC communication, the authors in [14] optimized resource allocation, including blocklength allocation, power control, trajectory planning, and energy harvesting, aiming to minimize the probability of decoding errors. The authors in [15] optimized UAV node placement, uplink power, and UAV-IoT device association jointly in the multi-UAV IoT communication system, in which UAVs act as BSs to gather data from IoT devices. To jointly optimize the total rate and reduce the transmit power of the UAV, the authors in [16] investigated the resource optimization problem for 5 G networks with UAV assistance. In future applications, the large-scale connections of numerous users necessitate the consideration of various quality of service requirements. The decision-making behavior of each user can significantly affect the interference levels of others, while the power control and blocklength problems are interdependent. To optimize the transmit power and blocklength while addressing these issues, deep reinforcement learning (DRL) offers a more efficient approach. By leveraging stored experience and deep neural networks to learn from the environment, DRL emerges as highly adept in non-convex problems. The authors in [17] proposed a resource allocation scheme based on DRL to guarantee high reliability and low latency for each wireless user under data rate constraints. The authors in [18] applied a DRL method to solve the problem of subcarrier power allocation in device-to-device communication. The proposed algorithm is well suited for the dynamic changes in the wireless environment. A method based on deep learning was proposed in [19] for the joint optimization of reconfigurable intelligent surfaces and power allocation of the access point across each subcarrier. However, investigating the joint allocation of blocklength and transmit power while considering the energy constraints of UAVs poses a significant challenge in UAV-assisted URLLC communication.

In this work, we consider an uplink UAV-assisted URLLC system, in which the ground users shared subcarriers to transmit their information to multiple UAVs. This work aims to jointly optimize the blocklength allocation and subcarrier transmit power to maximize the communication sum rate, subject to UAVs’ energy consumption. The main contributions of this work can be summarized as follows:

  • We formulate a communication rate maximization problem in the uplink UAV-assisted URLLC communication system with the blocklength, energy consumption, and transmit power constraints. The objective function and constraints make it challenging to make an intelligent decision in this complex communication environment.

  • To solve this non-convex problem, we propose a distributed DRL-based scheme to jointly optimize the transmit power and the blocklength. In this scheme, each subcarrier acts as an agent, making decisions on blocklength and transmit power based on the reward value. The weighted segmented reward function related to the UAV energy consumption and user rates are proposed to improve the rate performance.

  • The effectiveness and convergence of the proposed blocklength allocation and power control scheme are evaluated. According to the simulation results, the proposed scheme outperforms the Q-learning-based scheme and greedy scheme in different maximum transmit power and decoding error probability.

2 Related Work

In this section, we review the previous studies about UAV-assisted URLLC system, with an emphasis on URLLC services, resource allocation in UAV-enable wireless networks, and the application of DRL-based optimization method.

Compared to traditional terrestrial-based networks, UAV-assisted URLLC systems can leverage the flexibility and maneuverability of UAVs to dynamically adjust their deployment positions and form good channel link to achieve higher reliability and lower latency. The authors in [20] formulated the the sum rate maximization problem in the UAV-assisted URLLC system coexisting with the ground users. When the central ground station transmits the control signals to the UAV, the average packet error probability and effective throughput were studied in [21]. The authors in [22] addressed the physical layer security problem in URLLC systems, in which UAVs are regarded as the crucial component for mitigating physical layer security risks. The comparison of using UAVs as the relays to enhance security rate and employing UAVs as jammers to alleviate the impact of eavesdropping attacks on URLLC communication was analyzed. A task offloading method from devices to UAV was proposed in [23] to fulfill the low-latency demands by jointly optimizing the computing times of the devices and the UAV, the offloading bandwidths, and the location of the UAV. Moreover, the author in [25] investigated the resource allocation method for supporting URLLC in a bidirectional relay system by leveraging the advantages of both UAV and URLLC. It jointly optimized the time, the bandwidth, and the UAV position to maximize the transmission rate of the backward link under the constraints of URLLC requirements in the forward link. By employing age of information (AoI) as a new system metric, the author in [24] proposed an energy-efficient resource allocation scheme to support UAV-assisted URLLC communication systems under the blocklength and imperfect channel conditions constraints.

DRL algorithm is widely used in resource allocation and combinatorial optimization problems to solve the complex nonlinear challenges effectively [26]-[28], which is beneficial to the combination of the exploratory learning approach and the feature extraction in the high-dimensional space capabilities of deep neural networks. The authors in [26] focused on the power control for a UAV-assisted URLLC system and incorporated the deep neural network for channel estimation. The authors in [27] employed the actor-critic multi-agent deep reinforcement learning algorithm in the vehicle networks to obtain the optimal allocation of the frequency, the computation, and the caching resources. The novel centralized deep reinforcement learning and federated deep reinforcement learning frameworks were proposed in [28], aimed at optimizing the downlink URLLC transmission for the coexisting new radio in unlicensed spectrum with WiFi systems by dynamically adjusting the energy detection threshold.

In this paper, to solve the complex combinatorial optimization problem under the UAV energy consumption, the blocklength, and the transmit power constraints, the DRL algorithm is more suitable than the traditional optimization methods, such as the convex optimization and the heuristic algorithms. Due to the rate expression and block length constraint, the optimization problems in URLLC systems are typically non-convex [28]. Convex optimization methods aim to obtain the optimal solution [29]. However, it is difficult to solve the non-convex problem in the high-dimensional space or there exists the large amount of computation complexity to increase the computation time and energy consumption [30]. Approximate algorithms [31] such as the greedy algorithm, the local search algorithm, relaxation algorithm, can yield the approximate solution. However, when the problem is complex, the approximated performance remains unacceptable. Heuristic algorithms [32] also fail to effectively adapt to the dynamic environment and lack adaptability to the system. Difficulty in effectively exploring and exploiting these spaces will result in suboptimal solutions or inefficient computation. Thus, for high-dimensional and non-convex problems, the self-exploratory learning approach of DRL algorithms can achieve the rapid solution.

3 System Model

Fig. 1
figure 1

System model for multi-user UAV-assisted URLLC communication network

We consider an uplink UAV-assisted URLLC network consisting of a BS, \(\mathcal {M}\) single-antenna UAVs and multiple ground users, in which UAVs can exchange information at the specific BS and then collect information of the ground users, as shown in Fig. 1. Due to the limited energy of UAVs, each UAV has a specific communication area. For simplicity, let \({J}_m\) represent the ground user group of the m-th UAV, in which the each user \(k_j \in {J}_m\), and all the ground users share \(\mathcal {N}\) subcarrier. Furthermore, let H denote the flight height of the UAV to increase the probability of the LoS link. Thus, the three-dimensional (3D) position of the m-th UAV and the \(k_j\)-th user can be expressed as \((x_m^\textrm{U},y_m^\textrm{U}, H)\) and \((x_{k_j,m}^\textrm{G},y_{k_j,m}^\textrm{G},0)\), respectively. As a result, the distance between the m-th UAV and the \(k_j\)-th user is \(d_{k_j,m}^\textrm{3D} = \sqrt{(x_{m}^\textrm{U}-x_{k_j,m}^\textrm{G})^{2}+(y_{m}^\textrm{U}-y_{k_j,m}^\textrm{G})^{2}+H^{2}}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\). The horizontal distance between the m-th UAV and the \(k_j\)-th user is \(d_{k_j,m}^\textrm{2D} = \sqrt{ (x_{m}^\textrm{U}-x_{k_j,m}^\textrm{G})^{2}+(y_{m}^\textrm{U}-y_{k_j,m}^\textrm{G})^{2}}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\).

The probabilistic LoS channel model introduced in [34, 35] is used by this system in order to more correctly characterize the channel information. The LoS probability of the link between the m-th UAV and the \(k_j\)-th user \(\eta _{k_j,m}^\mathrm{{LoS}}\) depends on the elevation angle \(\theta _{k_j,m}\), which is given by \(\theta _{k_j,m} = \arccos (\frac{d_{k_j,m}^\textrm{3D}}{d_{k_j,m}^\textrm{2D}})\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\). Therefore, the LoS probability is modeled as follows

$$\begin{aligned} \eta _{k_j,m}^\mathrm{{LoS} } (\theta _{k_j,m}) = \frac{1}{1 + A_2 \exp (- A_1[\theta _{k_j,m} - A_2])}, \end{aligned}$$
(1)

where the parameters \(A_1\) and \(A_2\) depend on the environment [35]. Based on the LoS probability, the non-line-of-sight (NLoS) probability of the link between the m-th UAV and the \(k_j\)-th user is denoted as \(\eta _{k_j,m}^\mathrm{{NLoS}}(\theta _{k_j,m}) = 1 - \eta _{k_j,m}^\mathrm{{LoS}}(\theta _{k_j,m})\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) [35]. The pathloss of LoS link between the m-th UAV and the \(k_j\)-th user is given as [36]

$$\begin{aligned} PL_{k_j,m}^{\textrm{LoS}} = 28 + 20 \log _{10} f_c + 22 \log _{10} {d_{k_j,m}^\textrm{3D}}, \end{aligned}$$
(2)

where the \(f_c\) is the carrier frequency. Furthermore, the formula for the pathloss of the NLoS link is given by [36]

$$\begin{aligned} \begin{aligned} PL_{k_j,m}^{\textrm{NLoS}} =&-17.5 + 20 \log _{10} \frac{4 \pi f_c}{3} + 46 \log _{10} {d_{k_j,m}^\textrm{3D} }\\&- 7\log _{10} H \log _{10} {d_{k_j,m}^\textrm{3D} }. \end{aligned} \end{aligned}$$
(3)

Therefore, the channel power gain \(h_{k_j,m}, (\forall m \in \mathcal {M}, \forall k_j \in {J}_{m}) \) between the m-th UAV and \(k_j\)-th user can be written as

$$\begin{aligned} h_{k_j,m} = \eta _{k_j,m}^\mathrm{{LoS}} (\theta _{k_j,m}) PL_{k_j,m}^\textrm{LoS} + \eta _{k_j,m}^\mathrm{{NLoS}} (\theta _{k_j,m})PL_{k_j,m}^\textrm{NLoS}. \end{aligned}$$
(4)

Subsequently, let \({K}_n\) denote the \(k_j\)-th user identification that using the n-th subcarrier. According to (4), the received signal between the \(k_j\)-th user and the m-th UAV over the n-th subcarrier can be expressed as

$$\begin{aligned} y_{k_{j},m,n}=\sum \limits _{k_j \in K_n} h_{k_j,m}s_{k_j,m,n}+z_{m,n}, \end{aligned}$$
(5)

where \(s_{k_j,m,n}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) denotes the transmit symbol between the \(k_j\)-th user and the m-th UAV over the n-th subcarrier, \(z_{m,n}\) denotes the corresponding noise between the m-th UAV and the n-th subcarrier. For simplicity, we assume that the transmit symbol is distributed as a discrete, independent, and identical complex Gaussian variable, denoted by \(s_{k_j,m,n} \sim \mathbb{C}\mathbb{N}(0,1)\), and that the noise has a mean and variance of zero \(\sigma ^2\). Thus, the signal-to-interference-plus-noise ratio (SINR) between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier \(\gamma _{k_j,m,n}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) can be given by [35]

$$\begin{aligned} \gamma _{k_j,m,n} = \frac{p_{k_j,m,n} h_{k_j,m}}{\sum \limits _{i_j \in {J}_{m} \backslash k_j} p_{i_j,m,n} h_{i_j,m} +\sum \limits _{u \in \mathcal M \backslash m}\sum \limits _{i_j \in {J}_{u}} p_{i_j,n,u} h_{i_j,u}+ \sigma ^2}. \end{aligned}$$
(6)

In practice, this work especially focuses on the communication between the ground users and the UAVs. In the infinite blocklength communication, it has been shown that in the limit of blocklengths, reliable transmission with zero decoding errors probability can be achieved [37]. In URLLC systems, the strict latency requirements impose limitations on the data size, resulting in the unavailability of accurate encoding rates based on Shannon’s channel capacity. Therefore, in URLLC communication networks, the achievable rate \(R_{k_j,m,n}\) between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier must be approximated for a given error probability and finite blocklength, as shown in [7, 38]

$$\begin{aligned} R_{k_j,m,n} = \log _2(1 +\gamma _{k_j,m,n}) - \sqrt{\frac{V(\gamma _{k_j,m,n}) }{l_{k_j,m,n}} } \frac{\tilde{Q}^{-1}(\eta _{k_j,m}) }{\ln 2}, \end{aligned}$$
(7)

where \(\eta _{k_j,m}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) denotes the required decoding error probability between the m-th UAV and the \(k_j\)-th user. And \(l_{k_j,m,n}\) denotes the blocklength between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier. It should be noted that the approximation provided in [7, 38] is highly accurate, provided that the blocklength is greater than or equal to 100. Furthermore, \(V(\gamma _{k_j,m,n}) = 1- 1/(1 + \gamma _{k_j,m,n})^2\) represents the channel dispersion, while \(\tilde{Q}^{-1}(x)\) is the inverse Gaussian Q-function with \(\tilde{Q}(x) = \frac{1}{\sqrt{2\pi } } \int _{x}^{\infty } \exp ^{-t^2/2}dt\).

Since the \(k_j\)-th user can occupy multiple subcarriers for communication, let \(\mathcal N_k\) represent the \(k_j\)-th user’s occupied subcarrier index. As a result, in the UAV-assisted URLLC communication system, the rate of communication between the m-th UAV and the \(k_j\)-th user may be represented as

$$\begin{aligned} R_{k_j,m} = \sum \limits _{n \in \mathcal {N}_k} R_{k_j,m,n}. \end{aligned}$$
(8)

In the UAV-assisted URLLC communication system, the delay of transmit and the UAV’s total energy consumption are crucial performance indicators. To evaluate these metrics, we express the transmit delay between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier as \(T_{k_j,m,n}=l_{k_j,m,n}T_{s}\) [14], where \(T_{s}\) is the symbol duration that is equal to \(1 / W_\textrm{sc}\) and \(W_\textrm{sc}\) is the each subcarrier spacing. Based on the differences in transmit delay, the transmit energy consumption of the m-th UAV can be described as follows

$$\begin{aligned} E_{m}^\textrm{Tr}=\sum \limits _{k_j\in J_{m}}p_{k_j,m,n}T_{k_j,m,n}. \end{aligned}$$
(9)

The UAV’s hovering energy consumption, which is determined by its own performance and environmental factors, is often described as the total energy use of the system [3, 39]

$$\begin{aligned} E_{m}^\textrm{H} =\Big (\frac{\delta }{8}\rho G s_r \Omega ^3 R_z^3 +(1+k_z) \frac{Y^{3/2}}{\sqrt{2\rho s_r}}\Big )T_{k_j,m,n}, \end{aligned}$$
(10)

where \(\delta \) and \(\rho \) denote the profile drag coefficient and the air density, respectively. G and \(s_r\) represent robustness of the engine rotor and area of the rotor disc, \(\Omega \) denotes the angular speed of the rotor. The parameters of \(R_z\) and \(k_z\) denote the rotor radius and the incremental correction factor for power. Y is the weight of the UAV. Thus, the total energy consumption of the m-th UAV includes the transmit energy consumption and the hovering energy consumption which can be expressed as

$$\begin{aligned} E_{m} = E_{m}^\textrm{Tr}+E_{m}^\textrm{H}. \end{aligned}$$
(11)

In the UAV-assisted URLLC communication system, the objective is to increase the sum rate by jointly optimizing the power control and blocklength allocation over the subcarriers. As a result, the optimization problem may be described as follows:

$$\begin{aligned} \text {(P1):}&\mathop {\max }\limits _{ \{p_{k_j,m,n}, l_{k_j,m,n}\}} \sum \limits _{m=1}^{M}\sum \limits _{k_j\in {J}_{m}}R_{k_j,m} \end{aligned}$$
(12)
$$\begin{aligned} s.t.&\sum \limits _{k_j\in {J}_{m}} \sum \limits _{n=1}^\mathcal {N} l_{k_j,m,n} \le L_{\max }, n \in \mathcal {N}, m\in \mathcal {M},\end{aligned}$$
(13)
$$\begin{aligned}&0 \le p_{k_j,m,n} \le P_{\max }, n \in \mathcal {N}, m\in \mathcal {M}, \end{aligned}$$
(14)
$$\begin{aligned}&R_{k_j,m} \ge R_{\min }, k_j\in {J}_m, m\in \mathcal {M},\end{aligned}$$
(15)
$$\begin{aligned}&E_{m} \le E_{\max }, m\in \mathcal {M},\end{aligned}$$
(16)
$$\begin{aligned}&l_{k_j,m,n} \ge L_{\min }, n \in \mathcal {N}, m\in \mathcal {M}, \end{aligned}$$
(17)

where \(P_{\max }\), \(R_{\min }\), and \(E_{\max }\) represent the maximum transmit power, the minimum rate, and the maximum energy consumption, respectively. \(L_{\min }\) and \(L_{\max }\) denote the minimum blocklength, and maximum blocklength, respectively. \(R_{\min }\) is given by \(D/L_{\max }\) [21], where D is transmit data size of the ground user and, \(L_{\max }\) is always related to the maximum transmit duration \(T_{\max }\) and system bandwidth W, i.e., \(L_{\max } = WT_{\max }\) [40, 41], which means that the data transmit under the latency constraint, and the transmission has to be complete within the maximum blocklength \(L_{\max }\).

However, the sum rate is affected by the dynamic environment, and the objective function is non-convex. In addition, the coupling of transmit power and blocklength has a direct impact on SINR and rate, making it computationally complex to search for the optimal solution for multi-user power and blocklength in an unknown system. Hence, it is challenging to obtain the optimal solution using the standard convex optimization method. Thus, a reinforcement learning-based scheme is considered to solve this non-convex problem to make intelligent decisions. DRL algorithm is suitable for the complex URLLC dynamic environment in the work, and the deep neural network can model the high-dimensional space in the optimization problem [33].

4 Proposed Blocklength Allocation and Power Control Scheme

Recently, DRL is one of the promising machine learning methods to solve the resource allocation problem to enable the intelligence of wireless communication systems since it has a capable of making a decision by selecting the potential action based on the stored experiences. Instead of traditional reinforcement learning, it’s using the deep neural network to learn instead of the massive number of values. Markov decision process (MDP) is applied to model the reinforcement learning process. MDP can be modeled by a tuple \(<\mathcal {S}, \mathcal {A}, R, \gamma>\) with the state space \(\mathcal {S}\), action \(\mathcal {A}\), reward R and discount factor \(\gamma \in [0,1]\) [42]. At the step t, the agent selects the action \(a_t\) by interacting with the system environment to maximize the reward \(R_t = r_t + \sum \nolimits _{t'=1}^{t-1} \gamma ^{(t - t') } r_{t'}\).

In this paper, a novel multi-agent reinforcement learning scheme is proposed to jointly optimize blocklength allocation and power control in the UAV-assisted URLLC system with the high-dimensional action space. To obtain an effective reinforcement learning algorithm, it is essential to define the state space of the environment and the action space of the agent and to model a suitable reward function that satisfies the system’s constraints while maximizing the objective function of the problem (P1). The method we propose has some benefits, such as the capacity to handle complicated, high-dimensional action spaces and support cooperative decision-making among agents, which can produce more effective and efficient solutions.

4.1 State, Action, and Reward Function

Deep Q-Network (DQN) is a DRL algorithm that combines reinforcement learning with a deep neural network to approximate the Q-value function, enabling more efficient and effective decision-making in complex environments. DQN has it own advantages which make it become an effective tool for solving non-convex optimization problems. First, it can handle high-dimensional action spaces, which is often importance in complex systems. Second, it is particularly suited for issues with models that are unknown or poorly understood since it does not call on previous knowledge of the system dynamics or the goal function. Third, DQN is capable of efficiently exploring the action space to find optimal policies, which is essential in non-convex problems where the optimal solution is often difficult to determine. Finally, DQN is able to balance exploration and exploitation to ensure that it is continually learning and improving, even as the system changes over time. The detailed design framework based on DQN in URLLC communication system is shown in Fig. 2.

Fig. 2
figure 2

The framework of proposed DQN-based scheme

In this framework, we establish a targeted action space \(\mathcal {A}\) to denote the action space set. By weighting blocklength and communication rate, we design a segmented reward function \(r^t\) with the rewards and penalties to incentivize the agent to select better action. At the same time, through the stored experience data in the replay buffer, the estimate and target neural networks are updated in mini-batches to train the agent.

Agent: In our work, each subcarrier works as the agent in the complex systems. For example, at the step t, the current agent independently decides the transmit power value \(a_{k_j,m,n}^{p,t}\) and the blocklength \(a_{k_j,m,n}^{l,t}\) based on the current state \(\textbf{s}_{k_j,m,n}^t\) and reward value \(r^t\) to satisfy the blocklength constraint and maximize the sum rate performance.

Action Space: In the UAV-assisted URLLC system, the selection of optimization parameter is crucial for meeting the stringent requirements of low latency and high reliability [25]. Therefore, the action space includes the discrete transmit power values and the blocklength values, each agent can select action \(\textbf{a}_{k_j,m,n}^t = \{a_{k_j,m,n}^{p,t}, a_{k_j,m,n}^{l,t}\} \in \mathcal {A} = \{ \mathcal {A}^p, \mathcal {A}^l \}\) in any state to reach the next state at time slot t. For simplicity, we assume that each agent has the same action space \(\mathcal {A}^p = \{0, \frac{P_{\max }}{L_p -1}, \frac{2 P_{\max }}{L_p-1},..., P_{\max } \}, \mathcal {A}^l = \{0, \frac{L_{\max }}{L_l -1}, \frac{2 L_{\max }}{L_l-1},..., L_{\max }\}\), where \(P_{\max }\) denotes the maximum of transmit power and \(L_{\max }\) denotes the system total blocklength, \(L_p\) denotes the length of the action space \(\mathcal {A}^p\) and \(L_l\) denotes the length of the action space \(\mathcal {A}^l\). The agent can independently select the action \(a_{k_j,m,n}^{p,t} \in \mathcal {A}^p, a_{k_j,m,n}^{l,t} \in \mathcal {A}^l\) to maximize the reward value.

State Space: The state of the n-th subcarrier of the k-th user at step t is represented by the desired power and interference power, respectively, it can be expressed as. The state space consists of desired power and interference power, respectively, i.e., the state of n-th subcarrier of the k-th user at the step t is

$$\begin{aligned} \textbf{s}_{k_j,m,n}^t&= [p_{k_j,m,n}^t |g_{k_j,m,n}|^2, ..., p_{i_j,m,n}^t |g_{i_j,m,n}|^2, ...],\nonumber \\&\quad \quad i_j \in J_m, \end{aligned}$$
(18)

in the initial state, i.e., \(t=0\), each agent can randomly select the subcarrier power and blocklength according to the constraints (14), (17) and (13). Based on the current state \(\textbf{s}_{k_j,m,n}^t\) and the action \(\textbf{a}_{k_j,m,n}^t\), the agent can obtain the next state \(\textbf{s}_{k_j,m,n}^{t+1}\).

Reward Function: To maximize the communication rate of the k-user and satisfy the blocklength constraint, the difference between the system blocklength and the actually used blocklength can be modeled as

$$\begin{aligned} \phi = \sum \limits _{k=1}^K \sum \limits _{n=1}^N l_{k_j,m,n}-L_{\max }. \end{aligned}$$
(19)

Then, we take energy consumption and sum rate into account to design the reward function and improve the system performance by maximizing the the reward value [25]. Thus, the reward function of each agent can be modeled as

$$\begin{aligned} r_{k_j,m,n}^t = \left\{ \begin{aligned} {\lambda _1}\phi -R_{k_j,m,n}^t + {\lambda _2}E_{m},&R_{k_j,m,n}^t \le R_{\min }, \phi> L_{\max }, \\ {\lambda _1}\phi +R_{k_j,m,n}^t + {\lambda _2}E_{m},&R_{k_j,m,n}^t \ge R_{\min }, \phi > L_{\max }, \\ -R_{k_j,m,n}^t + {\lambda _2}E_{m},&R_{k_j,m,n}^t < R_{\min }, \phi \le L_{\max }, \\ R_{k_j,m,n}^t + {\lambda _2}E_{m},&\textrm{otherwise}, \end{aligned} \right. \end{aligned}$$
(20)

where \({\lambda _1, \lambda _2} \in (0, 1)\) are the weighted parameters. It is observed that the agent has penalty value when the blocklength constraint (17) and (13) is not satisfied and can obtain more reward value when blocklength becomes smaller. Thus, the agent can select the potential action to maximize the rate performance.

4.2 Proposed Deep Q-network Algorithm

Due to the high dimensionality of action space, the DQN is considered to solve the non-convex problem which learns the policy by the neural network rather than storing the Q value. It is because continuous power and blocklength in the URLLC communication system need to be quantized into discrete. And formulate the state-action function to characterize the influence of the selected action on the performance with a specific state. The computational complexity of all agents to calculate the reward is \(O(N \cdot |s_{k,n}|) \) with N denoting the number of agents. The complexity of action selection is usually determined by the network structure such as DQN. The neural network structure of DQN algorithm includes a single neural network with 3 hidden layers and 3K hidden nodes in each layer. For the DQN network, the number of neurons in the m-th layer is \(U_m\), and the number of layers in the DQN network is M. Thus, the computational complexity of the DQN networks for all agents is \(O(K(|s_{k,n}|.U_{2} + \sum _{m=3}^{M}(U_{m-1}U_{m}+U_{m}U_{m+1}+U_{M-1}.|a_{k,n}|)))\) [43]. Given the control policy \(\xi \) for the n-th subcarrier, the Q-function is defined as [44]

$$\begin{aligned} Q^{\xi }(\textbf{s}_n^t, \textbf{a}_n^t) = E \left[ r_n\left( \textbf{s}_n^t, \textbf{a}_n^t) + \sum \limits _{j=0}^{t-1} \gamma ^j r_n(\textbf{s}_n^j, \textbf{a}_n^j\right) \right] , \end{aligned}$$
(21)

where \(\gamma \in [0,1]\) is the discount factor. The current reward decided the Q-function when the discount factor \(\gamma =0\), i.e., the agent selects the action only depending on the current reward \(r_n(\textbf{s}_n^t, \textbf{a}_n^t)\). The optimal action to maximize the rate performance in (P1) is \(\textbf{a}_n^{t,*} = \arg \mathop {\max }_{\textbf{a}_n^j \in \mathcal {A}} Q^{\xi }(\textbf{s}_n^t, \textbf{a}_n^j)\) by searching Q-value under different potential actions.

To derive the optimal control policy \(\xi ^{*}\), the Q-function using the following scheme can be updated as [35]

$$\begin{aligned} Q^{t+1}(\textbf{s}_n^t, \textbf{a}_n^t) =&Q(\textbf{s}_n^t, \textbf{a}_n^t) + \nu \Big ( r(\textbf{s}_n^t, \textbf{a}_n^t)\nonumber \\&+ \gamma \mathop {\max }_{\textbf{a}_n^t\in \mathcal {A}} Q(\textbf{s}_n^{t+1}, \textbf{a}_n^j) - Q(\textbf{s}_n^t, \textbf{a}_n^t) \Big ), \end{aligned}$$
(22)

where \(\nu \) denotes the learning rate. According to (22), we can find that each subcarrier update the Q-function and study the control strategy based on the stored Q-values, and then selecting the action to maximize the reward. To tackle the action selection in the limited state-action information, an \(\epsilon \)-greedy strategy is adopted to explore the environment with the exploration probability \(\epsilon \), which is written as

$$\begin{aligned} \textbf{a}_n^t = \left\{ \begin{aligned}&\text {random}(\mathcal {A}),\;\text { with \; probalility} \; \epsilon , \\&\mathop {\arg \max }\limits _{\textbf{a}_n^j \in \mathcal {A}} Q^{\xi }(\textbf{s}_n^t,\textbf{a}_n^j), \; \text {with \; probability} \; 1- \epsilon . \end{aligned} \right. \end{aligned}$$
(23)

According to this strategy, the subcarrier based on the probability \(\epsilon \) to take a random action and explore the URLLC communication environment. The deep neural network may intelligently extract characteristics from the current data sets and reduce the computational complexity by forecasting the output since the subcarrier’s unknown state space should call for a large memory capacity and a slow convergence rate. According to the framework in Fig. 2, the tuple consists of the state, action, reward, and next state working as the input of deep neural network to output Q-value as \( Q(\textbf{s}_n^{t+1}, \textbf{a}_n^t| \theta _t)\) and \( Q(\textbf{s}_n^{t+1}, \textbf{a}_n^t| \theta _t^-)\) in the estimate and target neural networks, where \(\theta _t\) and \( \theta _t^-\) denote the parameters of the estimate and target neural network during i-th training, respectively. To guarantee stability, the target neural network for the deep neural network set is made to be an exact replica of the estimated neural network every \(N_\textrm{rep}\) steps. In order to acquire an optimal Q-function, it is crucial to adjust the parameters of neural network \(\theta _i\) based on the appropriate loss function. The loss function is defined as follows [44]

$$\begin{aligned} \mathcal { L } (\theta _{t}) = |r(\textbf{s}_n^t, \textbf{a}_n^t) \! + \! \gamma \mathop {\max }_{\textbf{a}_n^j \in \mathcal {A}} Q'(\textbf{s}_n^{t+1}, \textbf{a}_n^j|\theta _{t}^-)- Q(\textbf{s}_n^t, \textbf{a}_n^t| \theta _{t})|^2. \end{aligned}$$
(24)

The majority optimizers, such as the gradient descent method, may be used to determine the optimal neural network parameters based on the function’s loss and the training data set. The deep neural network must be trained using the training data. The approaches of experience replay and random sampling are used to address the reliance on training data. The proposed DQN algorithm use \(N_\textrm{mem}\) experience replay memory to save the tuple of reinforcement learning process and update the data every \(N_\textrm{tr}\) steps, which can keep the training data fresh. The experience data are randomly selected from the replay memory to complete the batch, which may smooth the transitions between the history data and the fresh observation. Algorithm 1 displays the proposed DQN-based scheme for the URLLC communication system.

Algorithm 1
figure a

Multi-agent power control and blocklength allocation scheme based on deep Q-network for URLLC communication system

5 Simulation Results

In this section, simulation results validate the effectiveness and convergence of the proposed multi-agent DQN-based algorithm for power control and blocklength allocation. Then, analyze the influence of different parameters on rate performance in UAV-assisted URLLC system. Python 3.6 and Tensorflow 1.13 are the simulation tools used to train the deep neural network. The simulation parameters are presented in Table 1. And K users are randomly placed in a 200-meter-diameter circle centered on the UAV.

Table 1 Simulation parameters for UAV-assisted URLLC communication system

The UAVs are randomly distributed within the BS service area with a radius of \(R_s\). Unless otherwise stated, the number of UAVs is set \(\mathcal {M}\)=2 and the number of ground users is K=2.

We design the neural network in DQN-based algorithm with one input layer, three hidden layers, and one output layer. And the parameters of the neural network is optimized by the gradient descent method. The training begins after \(N_\textrm{bat}\) steps to ensure the size of an efficient batch of data. Furthermore, to verify the proposed DQN-based algorithm is suitable for solving the optimization problem in this work, it is compared with the traditional Q-learning algorithm and greedy scheme. The Q-learning algorithm belongs to the class of value-based reinforcement learning methods. It involves the creation of a Q-table to store Q values for states and actions, and then to selection the action that maximizes the benefit. Therefore, the greedy algorithm is a well-known scheme for finding the optimal solution, which involves selecting the optimal choice in the current state.

Fig. 3
figure 3

The sum rate versus under different training episode

Figure 3 illustrates the comparison of rate performance among the proposed DQN-based scheme, Q-learning scheme, and greedy scheme with different learning episodes. It is observed that the proposed DQN-based scheme significantly outperforms the benchmark schemes in terms of rate performance. The state space, encompassing desired power and interference power, is vast due to the dynamic characteristics of wireless networks. The proposed DQN-based algorithm intelligently allocates resources based on the current state, leading to a substantial improvement in rate performance. The rate performance of the proposed scheme and Q-learning scheme converges to the optimal value with increasing training episodes. However, the proposed scheme exhibits significantly faster and more stable convergence compared to the Q-learning-based scheme. The Q-learning algorithm involves a large Q-table in wireless communication environments, which hampers query speed and the ability to query states effectively. Furthermore, the proposed DQN-based scheme outperforms the Q-learning scheme by 7.43%. The proposed scheme adopts experience replay and random sampling from batches during the learning process, allowing the agent to adjust quickly and efficiently to the dynamic environment. The experience replay effectively enhances the training efficiency and stability of the intelligent agent, thereby reducing instability and convergence difficulties during the training episodes. These results show the performance of the proposed DQN scheme to address this optimization problem and its adaptability to the dynamic characteristics of wireless networks.

The principle of a mini-slot is introduced by the international organization for standardization to allow URLLC applications by reducing the transmit time interval [45]. Additionally, the NR release-15 provides scalable numerology with subcarrier spacings of 15 KHz, 30 KHz, and 60 KHz below 6 GHz, and 120 KHz or 240 KHz above 6 GHz [46]. Figure 4 illustrates the impact of subcarrier spacing and error probability on the sum rate performance of the proposed DQN-based scheme. The maximum sum rate increases as the subcarrier spacing increases from 30 to 60 KHz. The proposed algorithm allocates a longer blocklength as the transmit bandwidth W increases due to the increase in subcarrier spacing. Furthermore, the rate performance decreases as the required error probability \(\eta \) decreases, which is consistent with the rate expression (7). The results show that when the learning rate is 0.0001, although the convergence of the proposed algorithm is fast, the convergence effect is not better. When the learning rate is 0.1, the DQN algorithm will converge to the local optimal solution faster, because the larger learning rate can make the parameter update faster. When the learning rate is 0.01, the smaller learning rate reduces the shock and instability in the training process, but it is not the optimal solution under the communication system. The simulation results show that when the learning rate is 0.001, the stability and convergence of the training can be guaranteed efficiently. In addition, the increasing learning rate may accelerate the convergence but it may lead to unstable training results.

Fig. 4
figure 4

The sum rate versus the subcarrier spacing with \(W_\textrm{sc} \) and the error probability with \(\eta \)

Fig. 5
figure 5

The sum rate versus the maximum transmit power

Figure 5 illustrates the impact of the maximum transmit power on the sum rate of the proposed scheme and benchmark schemes. By adjusting the maximum transmit power of users, it is possible to effectively alter the overall system rate. Experimental results demonstrate a stable increase in system rate as the maximum transmit power increases. This implies that increasing the maximum transmit power can enhance the data transfer rate of the system, thereby boosting network performance and efficiency. In the proposed DQN scheme, intelligent decision-making enables the system to better utilize the increased transmit power resources, further enhancing the overall rate. Experimental findings reveal that compared to traditional greedy algorithms, the proposed DQN scheme improves the total rate by 37.19%, and compared to Q-learning algorithms, by 13.05%. This underscores the effectiveness and superiority of DQN in optimizing system performance. Thus, by appropriately adjusting the maximum transmit power and integrating intelligent algorithms, it’s possible to maximize the data transfer rate and performance of the system.

Fig. 6
figure 6

The sum rate versus the learning rate \(\nu =[0.1,0.01,0.001,0.03,0.003,0.0001]\)

Figure 6 investigates the impact of the learning rate on the rate of the proposed DQN-based scheme. This figure verifies the influence of learning rate on convergence speed and convergence value under the difference allocation policy. It can be seen that when the learning rate \(\nu \)=0.0001, the proposed algorithm converges too fast and the obtained value is far from the optimal allocation policy. Moreover, \(\nu \)=0.1 or \(\nu \)=0.01 has better convergence speed and convergence value, but the final result of convergence is not optimal compared with the \(\nu \)=0.001. However, when the learning rate \(\nu \)= 0.03 or \( \nu \)= 0.003, the algorithm displays inferior convergence results, including convergence speed and efficiency. Therefore, selecting the optimal learning rate based on the communication environment can allow the proposed scheme to achieve a stable performance and obtain the optimal value for the current environment.

Fig. 7
figure 7

The effect of number of users K on the average rate of the proposed scheme with fixed \(\mathcal M\)

Figure 7 illustrates the impact of the number of users on the rate performance using the proposed scheme. In the simulation environment, each user is allocated \(N_\textrm{a}=2\) subcarriers, and \(N_\textrm{a}\) represents the number of subcarriers allocated to the user, which means users share the subcarrier. The results indicate that as the ratio of the number of UAVs to the number of users increases, the average rate of each subcarrier assigned to each user also increases. Specifically, the average rate of subcarriers with a ratio of \(\frac{\mathcal {M}}{K}=1/2\) demonstrates a rate performance gain of up to \(6.28\%\) compared to the ratio of \(\frac{\mathcal {M}}{K}=1/4\). This gain is due to the increase in interference power as the number of users covered by a single UAV increases. First, the communication interference increases with the increase of the number of users, and thus reducing the reliability and rate performance of communication. Second, the increase in the number of users within the range of UAV communication also has an impact on the coverage quality of the system. When the number of users increases, UAVs need more communication resources to meet the communication needs between users, which may lead to the shortage of communication resources and the reduction of coverage, thus affecting the communication quality of the system.

Fig. 8
figure 8

The sum rate versus the number of subcarriers \(N_a\)

Fig. 8 investigates the sum rate versus the number of allocated subcarriers. It can be observed that the system sum rate decreases with the increasing \(N_a\), because the inter-group interference is considered in SINR expression (6), so that there exist different users occupying the same subcarrier causing the interference. When \(N_a=2\), two subcarriers are allocated to users which can increase the user’s Signal to Interference plus Noise Ratio (SINR) compared to \(N_a=3/4\).

6 Conclusion

In this paper, the joint blocklength allocation and power control scheme based on multi-agent DQN algorithm was proposed to maximize the rate performance in the UAV-assisted URLLC communication system. The optimization problem is difficult to solve due to the non-convex UAVs’ energy constraints. Therefore, the non-convex optimization problem was decomposed into the multi-agent reinforcement learning process, in which each subcarrier works as the agent to intelligently determine its own transmit power and blocklength. According to blocklength, communication rate, and the energy consumption of UAVs, the proposed DQN-based scheme constructs the reward function. Furthermore, this work investigated the influence of the learning rate, the error probability and the subcarrier spacing imposed on the performance of system. According to the simulation results, it is shown that the proposed scheme outperforms the benchmark scheme in terms of effectiveness and convergence. In the future, exploring intelligent trajectory design is the interesting research direction in UAV-assisted URLLC communication systems.

7 Future Work

In the future work, the dynamic characteristics of UAVs can be studied by considering the flying trajectory, the dynamic deployment and the wireless channel in UAV-assisted URLCC system. According to the proposed method, the intelligent resource allocation method is interesting to design the high-dimensional action space, practical reward function depending on the optimization target and constraints.

Coordinated UAVs can enhance the rate and the reliability performance, and reduce the latency in the wide coverage. However, it is difficult to optimize the resource and trajectory for the large-scale UAVs. On the one hand, the complex interference consists of UAV to UAV link and UAV to user communication link. It limits the power and blocklength optimization to satisfy the rate and reliability performance. On the other hand, the large-scale UVAs require the large amount of computation due to the increasing number of UAVs and the high-dimensional action space. Using advanced algorithms such as DRL and distributed optimization, it can design the intelligent method to autonomously adjust the large-scale UAV network.