1 Introduce

The screw conveyor is a crucial component of the shield machine during its construction, as it plays a vital role in soil transportation within the sealed cabin. The speed regulation of the screw conveyor is of utmost importance in shield machine construction, considering the nonlinear dependence between screw conveyor speed and sealed cabin pressure. The accurate control of screw conveyor speed can avoid the risks of unstable excavation surfaces and ensure the safe and effective progress of tunneling operations. Figure 1 is a schematic diagram of earth pressure balance. However, presently, manual adjustment control mode cannot timely regulate the tunneling control parameters, which increases the risk of unbalanced excavation surface pressure and safety accidents. At the same time, it is important to note that the load on the excavation surface is distributed singularly, making the excavation construction environment harsh. Additionally, the geological conditions are complex and frequently changing, further complicating the construction process and increasing the risks associated with it. Consequently, studying the intelligent control of screw conveyor speed is critical to ensuring the safe tunneling of shield machines (Liu and Shao 2010).

Fig. 1
figure 1

Earth pressure balance diagram

In the field of shield technology, data-driven methods using machine learning have gained popularity for accurate prediction and control of the earth pressure in the sealed cabin (Chen et al. 2023; Huang et al. 2022; Mashimo 2002). Liu et al. (2022a) developed a prediction model for shield screw conveyor speed based on convolutional neural network-gated recurrent unit (CNN-GRU) to achieve real-time control of sealed cabin earth pressure during construction. To further improve control accuracy, a dynamic fuzzy neural network(D-FNN) control system was established (Liu et al. 2021b). Qin et al. (2021) developed a CNN-LSTM cutterhead torque prediction model. For improving the accuracy, cutterhead torque data were decomposed using multiscale wavelet packet decomposition (MWPD) and intrinsic discrete variational mode decomposition (IDVMD), GRU was used to multi-step predict cutterhead torque and further optimized the sealed cabin earth pressure control(Qin et al. 2022). Liu et al. (2021a) established a data-driven particle swarm optimization-least square support vector machine (PSO-LSSVM) sealed cabin earth pressure prediction model. To further improve the predictive accuracy, a sealed cabin pressure prediction model based on hybrid deep learning was proposed (Liu et al. 2022b). Under the premise of safe tunneling, Yu et al. (2022) used support vector regression to model tunneling parameters and solved them with PSO algorithm to improve tunneling efficiency. Wu et al. (2024) established a cutterhead performance prediction model based on deep residual network. Wang et al. (2024) established an improved model of soil parameters based on machine learning to maximize tunneling efficiency. Based on the method of separate calculation, Li et al. (2024) obtained the mechanism model affecting the torque of screw conveyor, which improved the performance of screw conveyor. Lv et al. (2024) established the performance model of the screw conveyor based on the double-layer progressive architecture, which improved the working efficiency of screw conveyor. Shi et al. (2024) and Qin et al. (2024) established the prediction model of cutterhead torque based on decomposition algorithm, hybrid transfer model, and residual convolution network, respectively. Hu et al. (2024) established a segment control model based on machine learning to ensure tunnel safety. Based on the hybrid deep learning algorithm, Jin et al. (2024) established a shield machine propulsion speed prediction model to accurately control the sealed cabin pressure. Furthermore, several data-driven optimization control methods (Yang et al. 2022; Ye et al. 2022, 2023) were introduced to address the issue of surface subsidence caused by pressure imbalance, significantly enhancing tunneling safety. However, the aforementioned data-driven deep learning methods require a significant amount of labeled data for network training, which is time and cost-intensive. As training data requires a certain level of timeliness, the generalization capability of the training results is weak and lacks real-time performance. Additionally, the uncertainty of soil geological conditions limits the reliability of sample data, which results in the instability of training results. In addition, when dealing with complex shield engineering, deep learning methods cannot dynamically adjust tunneling parameters to adapt to rapid environmental changes since they are unable to interact with environment. The limitations suggest that the utilization of deep learning for intelligent control purposes has certain constraints.

Deep reinforcement learning (DRL) can overcome the limitations of deep learning. This method has excellent self-learning ability and environmental interaction abilities. DRL has gained widespread application in the area of intelligent control and has demonstrated significant success in various application domains of shield machines. Zhang et al. (2020) proposed a hybrid model integrating DQN-PSO and Extreme learning machine (ELM) to accurately predict the surface response. Soranzo et al. (2023) established the Deep Q-learning (DQN) model and incorporated the soil parameters and some tunneling parameters into the prediction model to achieve the forecasting of the support pressure. This method also shows strong geological adaptive ability, which can quickly adapt to changes in geological conditions. Elbaz et al. (2023) proposed a prediction model based on DQN-PSO and ELM. The model predicts the thrust and cutterhead torque of the shield machine by inputting the tunneling parameters and geological parameters into the prediction model, which further improves the performance of drive system. Xu et al. (2023) integrated Iterative Deepening and Soft Actor-Critic to establish a shield machine attitude correction model and found the best correction strategy by interacting with the environment adaptively to attain intelligent correction of shield attitude. The studies mentioned above demonstrate that deep reinforcement learning has promising applications in the area of shield technology due to its remarkable perception and interaction with environment, which could potentially address the challenges of intelligent control of shield machines. Based on this, this paper proposes a novel method that utilizes the DDPG deep reinforcement learning algorithm to implement intelligent control of the shield machine’s screw conveyor speed.

This paper makes the following contributions: (1) This study proposes an intelligent control method of screw conveyor speed based on the deep reinforcement learning method, which can effectively improve the geological adaptive ability. (2) This approach enables the shield machine to continuously interact with the geological environment, adjust the screw conveyor speed dynamically, and achieve precise control of the sealed cabin earth pressure, thereby reducing manual intervention and errors, improving the automation level, and optimizing the tunneling process. (3) This method promotes the cross-integration of artificial intelligence, mechanical engineering, civil engineering, and other disciplines, and promotes the application and in-depth development of machine learning in the field of intelligent equipment and engineering. (4) This method has a good control effect on the sealed cabin pressure under various geological conditions and can complete kinds of soil transition tasks. It has strong soil adaptability and can respond well to the dynamic changes of soil conditions.

This paper is structured into six parts. Section 1 comprises the research topic. Section 2 covers the basic theory. Section 3 introduces the intelligent control scheme. Section 4 provides a detailed account of the model’s training process. In Sect. 5, the method is evaluated and verified from multiple aspects to demonstrate its effectiveness. Finally, the paper concludes with Sect. 6, summarizing the entire article.

2 Basic theory

Reinforcement learning differs from classical machine learning methods in that it achieves optimal strategies by controlling the objective. It relies on experience in a given environment and conducts supervised, iterative training by outputting data that is rewarded or penalized. In contrast to the traditional input–output method, reinforcement learning’s training is based on the reward-penalty mechanism, which evaluates the environmental changes resulting from a set of actions and adjusts its model by changing parameters.

2.1 Reinforcement learning

Reinforcement learning is a learning framework that consists of two fundamental elements: the agent and the environment. In this framework, agents interact with the environment to gather information, and their decisions are continuously improved to maximize cumulative rewards. At each time step, the agent in this framework selects an action and is rewarded or penalized based on the resulting state. The agent subsequently utilizes the feedback received to modify its next action during subsequent interactions with the environment. Figure 2 illustrates the control flow of the reinforcement learning process. At each time step, the agent obtains information about the current state, denoted as \({S}_{t}\), from the environment. Based on this state, the agent selects an action, denoted as \({a}_{t}\), and interacts with the environment. The environment provides feedback to the agent in the form of a reward or punishment signal, which indicates whether the action taken was beneficial or not. This feedback enables the agent to evaluate its actions and learn to make better decisions in the future. Finally, the state of environment transitions to a new state, \({S}_{t+1}\), which becomes the current state for the next time step. This iterative process of selecting actions based on states and receiving feedback continues until the agent achieves its learning objective.

Fig. 2
figure 2

Reinforcement learning control scheme

2.2 Deep deterministic policy gradient

Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning algorithm that is designed for continuous control tasks. Unlike traditional policy gradient methods, DDPG uses a deterministic approach in which the Actor network directly outputs the action value instead of probabilities. This helps to improve the search efficiency of the algorithm when exploring the space for optimal solutions. By using a deterministic strategy, DDPG can provide a more stable and robust learning experience. This feature enables it to solve complex continuous control tasks more effectively than other traditional reinforcement learning algorithms.

DDPG algorithm is shown in Fig. 3, agent observes the state information \(S\) of the environment and uses an Actor-Critic architecture that incorporates deep neural networks to select action \(A\) and get the evaluation value \(Q\). Target Critic network is used to evaluate next action and obtain the evaluation \(Q^{\prime}\) and the reward \(R\) to form the target expected value. Deviation between the target value and evaluation value is calculated using \({\text{TD}} - {\text{error}} = R + \gamma Q^{\prime} - Q\). To enhance the stability and training efficiency, the DDPG algorithm also employs experience replay and target network techniques. The experience replay method stores the experiences acquired by the interaction between agent and environment and randomly samples a certain number of experiences for network training. This eliminates the correlation between sample data and prevents it from affecting network training. Furthermore, the DDPG algorithm uses the Target Critic network to estimate the target value and delays the parameter update to reduce instability and oscillation during training.

Fig. 3
figure 3

Flowchart of DDPG algorithm

3 Intelligent control scheme design

This paper introduces a smart control methodology using the DDPG algorithm for the shield screw conveyor. As depicted by Fig. 4, the control scheme can be divided into three components: (1) According to the current soil mechanics theory, combined with the influence of current soil parameters, target pressure is calculated. (2) The construction of a DRL training environment. (3) On the basis of DDPG algorithm, an intelligent control model for screw conveyors is built. The intelligent control model observes the environment and obtains a set of empirical data \(\left[{S}_{t},{a}_{t},{r}_{t},{S}_{t+1}\right]\). These data are sent to Actor and Critic network for processing, where Actor network performs supervised learning and generates new strategic action. The whole process is continuously iterated. When output pressure of the shield machine executing intelligent strategy is not equal to the target pressure, the model training is re-performed. On the contrary, the intelligent control model outputs the best screw conveyor speed.

Fig. 4
figure 4

Intelligent control scheme of shield screw conveyor

3.1 Calculate the target pressure value

The general principle for setting target pressure during tunnel construction in shield tunneling is mainly based on various factors like underground water pressure, active earth pressure, passive earth pressure, and preparatory pressure (Zhu 2017).

The groundwater pressure acts in front of the cutterhead and behind the shield tail, as shown in Eqs. (1) and (2).

$$\sigma_{a} = q \times \gamma \times h$$
(1)
$${\sigma }_{b}={q}_{0}\times \gamma \times {h}_{0}$$
(2)

where,\({\sigma }_{a}\) is water pressure before cutterhead; \(q\) represents the soil permeability; \(h\) is distance between groundwater surface and the top of the cutterhead; \({\sigma }_{b}\) is the water pressure behind shield tail; \({q}_{0}\) represents the permeability coefficient of the grout and its water-cement ratio; \({h}_{0}\) refers to the difference in water level between the grouting point and top of the cutterhead. \(\gamma\) represents weight of water.

Based on the principles of advanced soil mechanics and the current soil parameter information, the calculations for active earth pressure and passive earth pressure can be derived using Eq. (3) and Eq. (4) shown below, respectively (Liu 2014).

$${\sigma }_{c}=\vartheta z{K}_{c}-2c\sqrt{{K}_{c}}$$
(3)
$${\sigma }_{d}=\vartheta z{K}_{d}-2c\sqrt{{K}_{d}}$$
(4)
$${K}_{c}={tan}^{2}\left(\frac{\pi }{4}-\frac{\phi }{2}\right)$$
(5)
$${K}_{d}={tan}^{2}\left(\frac{\pi }{4}+\frac{\phi }{2}\right)$$
(6)

where \({\sigma }_{c}\) represents active earth pressure; \(\vartheta\) is effective unit weight of the soil; \({\text{z}}\) is depth of the soil; \(c\) is cohesion of overburdened soil; \({K}_{c}\) represents the coefficient of active earth pressure. \({\sigma }_{d}\) is passive earth pressure; \({K}_{d}\) represents the passive pressure coefficient; \(\phi\) is internal friction angle.

Based on this, according to the theory of soil mechanics, the set interval of target pressure is:

$${P}_{max}=|{\sigma }_{b}-{\sigma }_{a}{|}_{max}+{\sigma }_{d}+{\sigma }_{i}$$
(7)
$${P}_{min}=|{\sigma }_{b}-{\sigma }_{a}{|}_{max}+{\sigma }_{c}$$
(8)

Here, \({\sigma }_{c}\) represents active earth pressure; \({\sigma }_{d}\) is passive earth pressure; \({\sigma }_{i}\) is the reserve pressure, which is generally considered to be 10–20 kPa.

3.2 Establishment of deep reinforcement learning environment model

Considering the significant nonlinear coupling between the screw conveyor speed and the earth pressure inside the sealed cabin, a mechanism model of both is utilized for creating a training environment for DRL. DRL training is performed according to environment model, and strategies and value estimates are adjusted by evaluating the rewards of taking different actions. Based on the equivalence of the amount of excavation and the amount of soil, the correlation between the sealed cabin pressure and the screw conveyor speed is derived.

The rate at which soil enters the sealed cabin can be expressed as:

$${Q}_{0}=\pi {R}^{2}V$$
(9)

The rate at which soil is discharged by the screw conveyor can be expressed as:

$${Q}_{1}=\eta \pi AT{n}_{0}$$
(10)

Among them, \({Q}_{0}\) represents the soil flow rate entering sealed cabin; \(R\) stands for the radius of the cutterhead; \(V\) denotes the advance speed; \({Q}_{1}\) stands for the soil dumping rate per unit time; \(\eta\) represents the dumping efficiency; \(A\) is effective cross-sectional area; \(T\) stands for blade pitch of screw conveyor; \({n}_{0}\) represents screw conveyor speed.

On the basis of principle of equal amounts of soil entering and exiting, the continuity equation for sediment flow within the sealed cabin can be derived as follows:

$${Q}_{0}={Q}_{1}+{C}_{ep}\left(P-{P}_{0}\right)+\frac{{V}_{e}}{{\beta }_{e}}\frac{dP}{dt}$$
(11)

Taking into account that \({C}_{ep}{P}_{0}\approx 0\), sealed cabin pressure mechanism model can be derived from Eq. (9), Eq. (10), and Eq. (11) and is as follows:

$$\pi {R}^{2}V-\eta \pi AT{n}_{0}={C}_{ep}P+\frac{{V}_{e}}{{\beta }_{e}}\frac{{\text{d}}P}{{\text{d}}t}$$
(12)

where \({C}_{ep}\) is the leakage coefficient outside the sealed cabin; \(P\) is the sealed cabin pressure; \({P}_{0}\) represents the soil mass pressure outside the sealed cabin due to leakage; \({V}_{e}\) is sealed cabin volume; \({\beta }_{e}\) represents the effective compression coefficient.

The model shown in Eq. (12) mainly involves two tunnelling parameters: shield advance speed and screw conveyor speed. Using MATLAB-Simulink, the mechanism model obtained from the derivation is constructed and used as a deep reinforcement learning training environment. It simulates changes in the pressure inside the sealed cabin of the tunnelling machine as the screw conveyor speed is varied.

3.3 Intelligent control strategy based on environmental interaction

The DDPG agent model adjusts the screw conveyor speed and interacts with the environment to continuously collect training data. This data is analyzed to determine the potential relationship between states and reward values, and the model trains itself to find the optimal strategy and value estimation function. After multiple rounds of iterative training, the best deterministic control strategy is obtained, enabling intelligent control of the screw conveyor and completion of complex tunnelling tasks. The DDPG agent model uses four deep neural networks and is based on the Actor-Critic framework, with the specific structure shown in Figs. 5 and 6.

Fig. 5
figure 5

Policy network structure

Fig. 6
figure 6

Value network structure

Actor network, also referred to as the policy network, outputs a deterministic policy that maps states to actions. The policy network takes in the current sealed cabin pressure \({P}_{t}\), target pressure \(P\), and absolute error \({P}_{e}\) between output and target pressure as input state set, as shown in the figure. This input state set is then processed through two hidden layers, each with three neuron nodes, before the output layer’s policy-the screw conveyor speed \({n}_{0}\) is determined.

Critic network, also referred to as the value network, estimates \(Q\) value of state-action pairs under the current policy, representing the predicted return. The value network takes in the current sealed cabin pressure \({P}_{t}\), target pressure \(P\), absolute error value \({P}_{e}\), and the policy action-the screw conveyor speed \({n}_{0}\) as the input state set. The value network includes three hidden layers, with 50, 25, and 50 neuron nodes, respectively, and finally outputs evaluation \(Q\) of the current state-action pair.

4 Establishment of an intelligent control model

The DDPG agent model is built using machine learning and deep learning methods and is trained and optimized to achieve more precise and efficient control of the task. The model training process consists of four main parts: (1) Defining state variables, which are the environmental states that DDPG agent can observe; (2) Defining action variables, which are the available actions that the DDPG agent can take; (3) Defining the reward function, which is system’s feedback to the DDPG agent’s actions in a given state; (4) Model training.

4.1 Defining the state space variables

The definition of state variables is vital in reinforcement learning, as DDPG agent transforms the environmental information it receives into a set of state representations. The real-time sealed cabin pressure, target pressure and pressure absolute error reflect the current state and target state of shield machines during construction. As one of the states, the sealed cabin pressure can enable DDPG agent to sense the change of the environment in time by monitoring the real-time sealed cabin pressure, to adjust the control strategy of the shield machine. As one of the states, target pressure is the ideal pressure value set by the shield machine control system, which is used to guide the operation of the shield machine and is an important guarantee for safe tunneling. As one of the states, pressure absolute error can help DDPG agent to understand the current control performance and judge the effectiveness of current control strategy, so as to take appropriate adjustment measures to reduce the pressure error and achieve a more stable tunneling process. By monitoring and updating state parameters, the dynamic changes of the environment can be better understood, and the control strategy of the shield machine can be adjusted to meet different geological conditions and construction requirements. Equation (13) provides a precise definition of the state variables.

$${S}_{t}=\left[{P}_{t},P,{P}_{e}\right]$$
(13)

where \({P}_{t}\) represents real-time sealed cabin pressure; \(P\) represents the target pressure; \({P}_{e}\) represents the absolute error between real-time sealed cabin pressure and target pressure. DDPG agent can continuously optimize and adjust its actions by trying different actions, observing the changes in the state values, comparing them with previous state values, and achieving better control effects. Specifically, DDPG updates the behavior policy of the agent by continuously carrying out policy optimization to maximize the cumulative returns of the reward function.

4.2 Defining the action space variables

The selection of the action is crucial for training effectiveness of the model and the performance of the DDPG agent. An ideal action space should cover all appropriate actions, enabling the DDPG agent to learn the optimal policy quickly and accurately. Screw conveyor speed is one of the key parameters to control the pressure in the sealed cabin, only it can control the sealed cabin pressure most directly and quickly. By adjusting the screw conveyor speed, the discharge speed of the slag in the sealed cabin of the shield machine can be controlled, which directly affects the change trend of the sealed cabin pressure. The screw conveyor speed is used as the choice of action space, so that the agent can flexibly and effectively optimize the control strategy, and then directly control the change of the sealed cabin pressure, so that it can be kept in a safe and controllable range. Equation (14) provides a clear definition of the action space.

$${a}_{t}={n}_{0}$$
(14)

where \({n}_{0}\) represents the screw conveyor speed. The DDPG agent transitions between environmental states by changing the screw conveyor speed. By continuous self-learning under the supervision of the value network and the reward function, the policy network can seek out action space that maximizes the cumulative reward value.

4.3 Defining the reward function

The design of reward function is crucial for training a high-performing intelligent agent. It represents the feedback signal that the agent receives from the environment after performing an action. The reward function reflects the degree of completion of the task by the agent and guides the agent in the desired direction. Each action step comes with corresponding reward feedback, making it easier to direct the agent toward the desired direction and accelerating the training pace. The complex working environment and the unbalanced pressure problem of the shield machine can cause surface deformations, leading to casualties in severe circumstances. Therefore, the main objective is to adjust sealed cabin pressure. By optimizing pressure control strategy during the shield tunneling process, a dynamic balance can be maintained between sealed cabin and excavation surface pressure, achieving the most desirable working state. Thus, a reward function is designed based on this ideal working state:

$$Reward=\left\{\begin{array}{c}10\\ -1\\ -50\end{array}\right.\begin{array}{cc}& 0\le {P}_{e}\le 0.01\\ & {P}_{e}\ge 0.01\\ & {P}_{t}\le 0\parallel {P}_{t}\ge 1\end{array}$$
(15)

Among them, \({P}_{e}\) represents absolute error between target pressure and real-time sealed cabin pressure; \({P}_{t}\) represents the real-time sealed cabin pressure. In pressure control tasks in sealed cabin, the absolute value of the pressure error can be controlled within the target pressure range (0,0.01), and the corresponding reward and punishment mechanism can be set up. A positive reward is given when the pressure inside the sealed compartment is stable within the target range, and a negative reward is given when the pressure error exceeds the target range. Additionally, a hard boundary is set for controlling the sealed cabin pressure throughout the training process, and a significant penalty is incurred when it reaches the boundary. This ensures the stability and safety of the entire control process, as well as encourages the agent model to quickly learn and optimize the optimal solution while exploring, and improving control accuracy and safety.

4.4 Model training

The DDPG agent receives a state space variable \({S}_{t}\) from the environment, comprising of the sealed cabin earth pressure, target pressure, and the absolute pressure error, which it inputs into the Actor network. Actor network selects the corresponding action \({a}_{t}\) based on the current state \({S}_{t}\) and inputs it into the environment. The environment outputs the appropriate reward \({r}_{t}\) and the next state \({S}_{t+1}\). At the same time, the composed quadruple \(\left({S}_{t},{a}_{t},{r}_{t},{S}_{t+1}\right)\) is added to the experience replay buffer. For each set of data, the numbers of data in the experience playback pool are determined. If the number of empirical data groups reaches the preset threshold, the training process is performed; Otherwise, the sampling process continues. The specific training process is shown in Fig. 7.

Fig. 7
figure 7

Training process

N pieces of data are sampled from the experience replay buffer and transmitted to both the Actor and Critic networks. Critic network then evaluates the current action \({a}_{t}\) based on the sampled data and calculates the corresponding evaluation value \(Q\left({S}_{i},{a}_{i},{\theta }_{Q}\right)\). The target network in Eq. (16) is slowly updated via soft update. The next state \({S}_{t+1}\) is inputted into the target Actor network to obtain the corresponding action \({a}_{t+1}\) for the state, and the new parameter value calculated by Target Critic network is more stable to compute the evaluation value of the next state-action \({a}_{t+1}\), as shown in Eq. (17).

$$\theta_{{\mu^{\prime}}} = \tau \theta_{\mu } + \left( {1 - \tau } \right)\theta_{{\mu^{\prime}}}$$
(16)
$$\theta_{{Q^{\prime}}} = \tau \theta_{Q} + \left( {1 - \tau } \right)\theta_{{Q^{\prime}}}$$
(17)
$$y_{i} = r_{i} + \gamma Q^{\prime}\left( {s_{i + 1} ,a_{i + 1} ,\theta_{{Q^{\prime}}} } \right)$$
(18)

In the formula, \(\tau\) represents learning rate; \({\theta }_{\mu }\) represents Actor network parameters; \({\theta }_{Q}\) represents Critic network parameters; \(\theta_{{\mu^{\prime}}}\) represents the target strategy network parameters; \(\theta_{{Q^{\prime}}}\) represents Target Critic network parameters; \(\gamma\) is the discount rate; \({r}_{i}\) represents reward value; \({y}_{i}\) is target expected value; \(Q^{\prime}\left( {s_{i + 1} ,a_{i + 1} ,\theta_{{Q^{\prime}}} } \right)\) is the evaluation value of the next state;

Critic network’s objective function is defined as \({y}_{i}-Q\left({s}_{i},{a}_{i},{\theta }_{Q}\right)\). The gradient descent method is then utilized to minimize difference between target expected value and evaluation value. This minimizes the discrepancy between the evaluation and target values and thus completes the update of Critic network parameters, as depicted in Eq. (19). Additionally, under the guidance of Critic network, Actor network maximizes the expected cumulative reward value by using the gradient descent method to update its parameters according to objective function, as shown in Eq. (20).

$$J(w)=\frac{1}{n}\sum_{i=1}^{n}{\left({y}_{i}-Q\left({s}_{i},{a}_{i},{\theta }_{Q}\right)\right)}^{2}$$
(19)
$$D\left( w \right) = - \frac{1}{m}\mathop \sum \limits_{i = 1}^{m} Q\left( {s_{i} ,a_{i} ,\theta_{Q} } \right)$$
(20)

where \(Q\left({s}_{i},{a}_{i},{\theta }_{Q}\right)\) is the evaluation value of the action; \(J(w)\) is the difference between the target expected value and the evaluation value; \(D(w)\) represents the cumulative expected return.

The pseudo-code of the DDPG agent training process of the shield screw conveyor is shown in algorithm 1.

figure a

DDPG agent training process of the shield screw conveyor

5 Simulation results analysis

The selected case study for this research is a section of the Beijing Metro Line 10 with a tunnel buried 12.6 m deep and a water level depth of 7.1 m. The stratum is a type of sand-soft rock stratum. The geological profile of the tunnel is depicted in Fig. 8. The control approach suggested for this research uses Simulink in MATLAB to build pressure environment model. Reinforcement Learning Designer toolbox is used to train the control model. Due to the shield tunneling speed being slow, the propulsion speed is set to 40 mm/min for simulation experiments. From the rationality of the intelligent control strategy, adaptability of intelligent control model, and accuracy and superiority of sealed cabin pressure control effect, respectively.

Fig. 8
figure 8

Tunnel geological section

5.1 Hyperparameter selection

In order to find the best learning rate combination of the strategy network and the evaluation network, the learning rates of the Actor network and the Critic network are set to 0.1,0.01,0.001,0.0001 tests respectively. By observing and analyzing the cumulative reward value of the intelligent control model over time in the control process, it is judged whether the current learning rate combination is the best. The test results are shown in Fig. 9, and the legend in the result diagram is: (Actor learning rate, Critic learning rate).

Fig. 9
figure 9

Comparison results of different learning rate combinations of Actor-Critic network

From Fig. 9 (1, 2, 4, 5, 7), it can be seen that when the learning rate combination is (0.01,0.01), (0.0001,0.1), (0.001,0.001), (0.0001,0.0001), (0.0001,0.0001), (0.0001,0.01), in the control process of intelligent control model, the reward value tends to be negative infinity with time, and the control effect is unstable; From Fig. 9 (3, 8) and (9), it can be seen that when the learning rate combination is (0.1,0.001), (0.1,0.1), (0.01,0.001), the reward value of intelligent control model fluctuates greatly in the control process, and control effect is unstable; Fig. 9 (6) clearly shows that when learning rate combination is (0.0001,0.001), that is, the learning rate of Actor network is 0.0001, and the learning rate of Critic network is 0.001, the reward value of the intelligent control model in the control process tends to be positive infinite convergence with the passage of time, the cumulative reward value tends to be maximized, and the control effect is relatively stable.

In summary, the optimal learning rate combination of Actor-Critic network is (0.0001,0.001). At this time, the cumulative reward is the highest and the control effect is the best. The setting of structural hyperparameters in this paper is shown in Table 1.

Table 1 Selection of hyperparameters

5.2 Rationality of control strategy

By comparing the control effect of the proposed intelligent control strategy and actual strategy on the sealed cabin pressure, the rationality of the intelligent control strategy is verified. The target pressure is 0.15 MPa, and the strategy distribution comparison diagram is shown in Fig. 10.

Fig. 10
figure 10

Strategy distribution comparison diagram

As shown in Fig. 10, the screw conveyor speed in the actual working condition is mainly concentrated at 2.8 rpm-3.2 rpm. It can be known from expert experience that the best screw conveyor speed is 3 rpm under the current working condition. The screw conveyor speed of the intelligent control strategy is concentrated at 3.1 rpm, and the optimal screw conveyor speed error with expert experience is 0.1 rpm. The screw conveyor speed of the intelligent control strategy is more stable and the fluctuation range is smaller than that of the actual working condition. And closer to the expert experience value of the best screw conveyor speed value.

The sealed cabin pressure obtained by intelligent control strategy is compared with the actual working condition of sealed cabin pressure, as shown in Fig. 11. It is evident that the sealed cabin pressure curve obtained by shield machine executing the intelligent control strategy is smoother than the actual working condition pressure value curve, the control effect is more stable, and the error value with the target pressure value is smaller.

Fig. 11
figure 11

Sealed cabin pressure comparison diagram

Combined with the above analysis, it can be evident that intelligent control strategy can control sealed cabin pressure more stably. Therefore, the rationality of the intelligent control strategy can be explained.

5.3 Self-adaptability of intelligent control model

In order to test the adaptability of the method to geological conditions, this paper tests the control effect of intelligent control model on the sealed cabin pressure under different geological conditions. The test simulated soil conditions of clay, silt, sand, pebble soil, mudstone, sandstone, soft rock, fully weathered rock, and highly weathered rock respectively. The target pressure values under corresponding soil are calculated as follows: clay: 0.15 MPa; silt: 0.17 MPa; sand: 0.18 MPa; pebble soil: 0.2 MPa; mudstone: 0.23 MPa; sandstone: 0.25 MPa; soft rock: 0.28 MPa; fully weathered rock: 0.3 MPa; highly weathered rock: 0.32 MPa. In order to present the image more clearly, the target pressure value under the soil condition is used to replace the control effect of the intelligent control model on the sealed cabin pressure under the geological condition. The test results are shown in Fig. 12.

Fig.12
figure 12

Control effect under different soil conditions

It can be seen from Fig. 12 that intelligent control model has a good control effect on the sealed cabin pressure under the above nine geological conditions. When the soil conditions are clay, silt, sand, and pebble soil, sealed cabin pressure gradually converges to the corresponding target pressure value after 45s, and finally fits the target pressure value. When the soil conditions are mudstone, sandstone, soft rock, fully weathered rock, and highly weathered rock, due to the hard soil, the sealed cabin pressure gradually converges to the corresponding target pressure value after 60s, and finally fits the target pressure value. It can be seen that under the above nine geological conditions, the sealed cabin pressure begins to converge to the corresponding target pressure value within 60s, and reaches the fitting state at about 80s. This time is within a data sampling period, which is fully in line with the requirements of normal tunneling construction data collection. Therefore, it can be explained that the intelligent control model has a good control effect on the sealed cabin pressure under different geological conditions.

In addition, the geological adaptive ability of intelligent model is further verified by testing the control effect of intelligent control model on sealed cabin pressure in the face of soil changes. The test simulated the transition from clay, silt, sandy soil, pebble soil, mudstone, sandstone, soft rock, fully weathered rock, and highly weathered rock to the remaining 8 kinds of soil in turn, a total of 72 cases. The test results are shown in Fig. 13.

Fig. 13
figure 13figure 13figure 13

Control effect when soil quality changes

It can be seen from the control effect diagram when the soil changes, the soil changes in all the above cases occur at 200 s, and the sealed cabin target pressure value also changes accordingly. Here, take Fig. 13a as an example to illustrate, Fig. 13a as the target pressure value changes from 0.32 MPa to 0.3 MPa, 0.28 MPa, 0.25 MPa, 0.23 MPa, 0.2 MPa, 0.18 MPa, 0.17 MPa, 0.15 MPa, that is, the soil is transitioned from highly weathered rock to completely weathered rock, soft rock, sandstone, mudstone, pebble soil, sand soil, silt, clay. The control effect diagram of intelligent model on the sealed cabin pressure. The sealed cabin pressure fluctuates from 200 to 250s, tends to the target pressure under corresponding soil after 250 s, and gradually fits with the target pressure. It can be seen from Fig. 13 that when the geological conditions are close to the two kinds of soil hardness before and after the transition, the adjustment time of intelligent model is relatively short, which needs less time to adjust, basically between 30 and 50s. When the soil strength of the two soils before and after the transition is quite different, it takes a relatively long time to adjust, about 60–70s, but this time is still in a data sampling period, which fully meets the requirements of normal tunneling construction data collection. Therefore, the above results show that the model can be adjusted timely and accurately when the geology changes, indicating that the model has strong geological adaptive ability.

Fig. 14
figure 14

Comparison of output pressure value and target pressure value

In summary, it can be explained that the intelligent control model can accurately control the sealed cabin pressure under different soil conditions, and the intelligent control model can complete the soil transition task well under the premise of accurately controlling the sealed cabin pressure. Therefore, the intelligent control model has strong soil adaptability and can respond well to dynamic changes in soil conditions.

5.4 Efficiency of the control method

The sealed cabin pressure control effect obtained by intelligent control strategy of the shield machine is tested. At this moment, target pressure is 0.15 MPa, Comparison between the output pressure and target pressure is shown in Fig. 14.

From Fig. 14, it is evident that sealed cabin pressure output by the shield machine executing the intelligent control strategy fluctuates slightly around target pressure in first 60s, and coincides with target pressure after 60s. Finally, the two are completely fitted. From Fig. 15, it is apparent that the error between the sealed cabin pressure output by the shield machine executing the intelligent control strategy and the target pressure value fluctuates greatly in the first 20s, and the 20–60s fluctuates slightly from − 0.05 to 0.03 MPa. After 60s, it gradually stabilized, and the final pressure error value stabilized at 0 MPa.

Fig. 15
figure 15

Error diagram of output pressure and target pressure

On the basis of the results, it can be concluded that the intelligent control model demonstrates excellent control effectiveness on the sealed cabin pressure regulation. The pressure error with the target pressure is 0 MPa, thus ensuring the safe operation of the shield machine.

5.5 The superiority of control effect

For validating effectiveness of intelligent control, a comparison and analysis are conducted on the sealed cabin pressure control performance using LSTM and MPC-based methods. The current soil target pressure value is calculated to be 0.15 MPa. Comparison result is depicted in Fig. 16.

Fig. 16
figure 16

Comparison results of sealed cabin pressure

From Fig. 16, it is apparent that the regulated performance of LSTM for sealed cabin pressure is superior to that of manual experience adjustment, there is still a problem of large fluctuation of sealed cabin pressure. Compared with this, control effect of MPC is more stable than that of LSTM model, but the error with target pressure is larger. The intelligent control strategy has the best control effect on sealed cabin pressure. The control performance is much closer to the target pressure, with the pressure error reduced to 0 MPa, resulting in the best possible control performance.

As demonstrated above, the intelligent control method proposed in this paper surpasses the control accuracy and stability of both MPC and LSTM methods in the shield tunneling process. This method provides precise pressure control in the sealed cabin, thereby reducing the risk of earth pressure fluctuation and ensuring safe and efficient construction of the shield machine. Overall, this proposed method holds promising potential for improving the construction and operation of future tunnels.

6 Conclusion

In this study, the deep reinforcement learning method is applied to intelligently control the screw conveyor speed. This method realizes the intelligent control of the screw conveyor speed by constructing the Actor-Critic model, using Actor network learning strategy, Critic network evaluation strategy, and reducing a lot of manual intervention. The proposed method can dynamically adjust the screw conveyor speed and accurately control the pressure of the sealed cabin with a pressure error value of 0 MPa during real-time interaction with the geological environment. Tests show that: (1) The strategy action given by this method is in line with the actual strategy action. (2) The method in this paper can accurately control the sealed cabin pressure under different soil conditions, can smoothly adapt to the soil change process, and ensure a certain control accuracy, with strong soil adaptive ability. (3) This method is superior to other control methods in real-time adjustment ability, self-learning ability and soil adaptability, and has obvious advantages in the face of complex geological and soil transition tasks. The intelligent control method can reduce the influence of manual intervention and human error, improve the automation degree of the tunneling process, and provide effective technical support for the optimization of shield tunneling process. It should be noted that this paper is designed to construct an environmental model using sealed cabin pressure mechanism model, which may not accurately simulate the real sealed cabin earth pressure environment. Therefore, repeated field tests are needed to enhance and improve the feasibility and practicability of the reinforcement learning scheme. In addition, this paper conducts simulation experiments when the tunneling speed is a fixed value. In the future, intelligent coordinated control of multiple parameters needs to be considered. At the same time, it is considered to combine other models with reinforcement learning control schemes to optimize and improve them from the aspects of training speed and generalization ability. It can also focus on the design and optimization of human–computer interaction interfaces, providing an intuitive and easy-to-use interface.