1 Introduction

As one of the development directions, autonomous driving is drawing more and more attention worldwide. Model predictive control (MPC) is able to deal with optimization problems with multiple objectives [1]. It iteratively solves an optimization problem over a finite horizon, to provide online optimal solutions subjected to constraints [2]. By constructing an optimization problem, the MPC algorithm ensures the car avoids collision and improves its performance in the dynamic environment [3]. According to Google Scholar, more than 7000 papers based on MPC have been published every year in the area of autonomous vehicles in the past three years.

MPC-based studies on autonomous vehicles mainly focus on planning or tracking the planned trajectory. A reasonable design of performance indices enables autonomous vehicles maintain an appropriate relative distance and velocity from surrounding vehicles, effectively alleviate traffic congestion, and reduce traffic accidents [4]. Jeong et al. [5] designed a MPC with fixed weights to improve trajectory and speed tracking performance by distributing control forces to multi-actuators. Ammour et al. [6] studied the trajectory planning of autonomous vehicles on the expressway based on a weight-fixed MPC such that a vehicle can improve safety by overtaking and lane changing. Wu et al. [7] designed a non-local controller based on MPC, which can attenuate the oscillation obviously and present a good riding comfort. Zhou et al. [8] developed an MPC based on the car following (CF) model with fixed weights. The objective function is designed based on the historical state data of the front vehicle, realizing a smooth response to the merged traffic. Sun et al. [9] proposed a hybrid MPC method for autonomous speed regulation, and it improved ride comfort by approximating the vehicle longitudinal dynamics as a two-mode discrete-time hybrid logic dynamics system. As one of the critical evaluation indices of vehicle performance, the energy-optimal speed trajectories are widely studied. Dollar et al. [10] applied a weight-fixed MPC to the longitudinal motion regulation in mixed traffic scenarios to realize the improvement of fuel economy, where objective function is designed using the instantaneous motion information of multiple preceding vehicles. Sun et al. [11] adopted a target switching MPC for global velocity tracking and adaptation, which greatly reduced the computation time for speed planning and saved 22.0% energy compared to human driving. As can be seen in existing studies, MPC controllers mostly adopt fixed weights, which can achieve a better performance in certain scenarios. However, driving scenarios in real world are uncertain, dynamical, and time-varying such that they cannot perform well in the real-world driving [12].

An excellent driving algorithm should be able to dynamically reach a balance among multiple performance indicators and adapt to different scenarios [13]. At present, there has been some researches on this problem [14,15,16], such as weight-adjustment or adaptive methods based on fuzzy control or personal driving data. Chang et al. [17] performed a real-time optimization of a weight matrix in MPC via fuzzy control, which improved the accuracy of tracking, ride comfort, and stability. Pang et al. [18] proposed an MPC that adaptively adjusts the weights of the cost function based on a fuzzy inference system, which significantly improved the MPC performance and control accuracy. Shivram et al. [19] used MPC based on fuzzy logic to improve the CF accuracy and ride comfort of independent vehicles. Tian et al. [20] proposed a coordinated tracking control strategy, they based on a fuzzy rule to adjust the weights of MPC, which finally improved the path tracking accuracy and stability of the vehicle at high speed with large curvature. Liu et al. [21] presented a shared steering mechanism based on MPC, so that the weight can be changed adaptively based on the consequence of risk assessment and predefined strategy to ensure driving safety. Liang et al. [22] proposed an adaptive multi-MPC scheme, which introduced a weighted adaptive mechanism based on rules to handle various driving conditions, especially some extreme cases. Rokonuzzaman et al. [23] proposed a longitudinal MPC controller for autonomous driving, which adaptively adjusts its weights based on a data-driven approach, so that the vehicle dynamics aspects such as speed, acceleration and jerk can be balanced. The existing weight adjustment methods with fault tolerance, reliability, and traceability advantages, can improve vehicle performance in several specific scenarios, but they generally are not able to cope well with dynamic scenarios.

Reinforcement learning (RL) as an auto-learning method, is able to encourage vehicles to explore under different scenarios with a reasonable reward function, through trial-and-error approach to accumulate experience and improve performance [24]. In fact, existing studies have extensively discussed the reward function design and vehicle performance optimization for diverse application scenarios [25,26,27]. It is suitable for dealing with the problem of scene changes and performance optimization [28]. However, RL has some shortcomings, for example, it is difficult to converge, and even if convergence is achieved, the effect of training is not always satisfactory [29]. In addition, as a safety-critical system, the vehicle cannot be tried randomly, but should always explore based on safety [30]. So just using RL methods is not enough to meet the requirements.

In this paper, we considered combining MPC with RL to automatically reach a balance among different performance index in dynamical scenarios. In this way, we can not only use MPC algorithm with hard constraints to ensure safety, but also extract complex features of dynamic scenarios as the basis of adaptive correction through RL. Therefore, the proposed combined strategy possesses the features to adaptively adjust controller parameters under different scenarios. Following is a summary of the main contributions in this paper.

  1. 1.

    This study proposes a weight-adjustment strategy for MPC based on RL, according to the state of surrounding environment to achieve a trade-off among safety, comfort, and energy saving of autonomous vehicles.

  2. 2.

    It summarizes the correlation between performance indices of speed control and the adaptive rules for MPC parameters to comprehensively improve the overall performance of autonomous vehicles.

This paper follows the following structure: Section 2 introduces the proposed combined RL and MPC strategy for autonomous vehicles. Section 3 defines the scenario risk assessment and vehicle performance analysis. Section 4 introduces MPC and RL algorithm. Finally, evaluation results of the proposed strategy for autonomous speed control are presented in Section 5.

2 Structure of the Combined Strategy

This section introduces the architecture of the combined RL and MPC strategy for autonomous vehicles and define four different scenarios depending on the risk level of the environment. As shown in Figure 1, the RL algorithm takes the environment state in the current scene as input and outputs MPC variable weights. MPC calculates the optimal acceleration for lower layer controller. Furthermore, a risk threshold model is proposed for scenario risk assessment, so as to guide the reward function of RL, and MPC constraints are considered as well.

Figure 1
figure 1

Combined RL and MPC strategy

As shown in Figure 2, the scenarios of autonomous vehicles are complex and diverse. Different driving scenes have different characteristics, and the road conditions change dynamically. Even different vehicles on the same section of the road also drive differently. Four scenarios are shown on the right side in Figure 2, and the gradation of color is used to represent the transformation of scene risk degree caused by the change of the relative distance. Yellow, green, orange, and red represent the crisis scenario, the safety scenario, the low-risk scenario, and the high-risk scenario respectively. Under the crisis scenario, the vehicle should reduce ego-predecessor distance to ensure the CF task; under the safety scenario, the vehicle should keep the speed constant or accelerate slowly to reduce unnecessary jitters and pursue a better energy efficiency and ride comfort; under the low-risk scenario, due to the risk of collisions, much more emphasis should be placed on safety; in the high-risk scenario, the vehicle must slow down immediately to increase the distance between two vehicles and ensure that a collision does not occur. Scenario adaptability mentioned in this paper means that vehicles can autonomously adjust their behaviors based on different scenario risk levels to maximize vehicle performance as mentioned above.

Figure 2
figure 2

Dynamic scene and classification

3 Scenario Risk Assessment and Performance Analysis

It is necessary to analyze the risk degree and the priority requirements of vehicles in different scenarios to improve the adaptive adjustment ability of MPC with multiple weights. This paper proposes the risk threshold model (RTM), which can be used to evaluate the scene risk by analyzing the characteristics of the environment, classifying the scenes into four risk levels: the crisis scenario, the high-risk scenario, the low-risk scenario, and the safety scenario.

According to the RTM, the inputs are the relative distance and velocity of two vehicles in the CF maneuver, and the risk level of the scene is the output, as shown in Figure 3.

Figure 3
figure 3

Risk assessment mechanism of the RTM

Based on the adjustment mechanism, some reference values are shown in Table 1.

Table 1 Parameters and symbols of RTM

When the relative distance exceeds the maximum following distance defined in the CF scenario, the vehicle does not meet the prerequisites of CF task such that the vehicle will be in the crisis scenario case. When the relative distance is between the safe driving distance and the maximum following distance, the autonomous vehicle can safely stop and the vehicle is in the safety scenario. When the relative distance is between the dangerous stopping distance and the safe stopping distance, where, when the relative velocity is less than the dangerous relative one, the stopping distance can be divided into the low-risk scenario and the safety scenario, and if the relative velocity is greater than the safe relative velocity, the vehicle will be able to stop instantly and it will be in the safety scenario. In cases that the relative distance is smaller than the dangerous stopping distance, and the relative velocity is less than the safety one, collision accidents will happen such that the vehicle is in the high-risk scenario, when the relative velocity is greater than the safe one, the preceding vehicle will always maintain the leading position, which is generally in the safety scenario. However, the state of the preceding vehicle is uncertain, the stopping distance of the ego is related to the relative velocity, which can be subdivided into the low-risk and the safety scenario.

During driving, vehicles are seeking for a balance among different performance, such as CF performance, safety, fuel economy, ride comfort under dynamic scenarios. However, there are conflict constraints for different performance. Exploring the correlation between performance and constraints is not only helpful to optimize vehicle performance and improve vehicle scene adaptability, but also lay a foundation for subsequent research on performance improvement and scene extension.

4 Combined RL and MPC Strategy

This section introduces the objective function and constraints of MPC controller and the RL algorithm for the adaptive adjustment of the weights in the objective function.

4.1 Model Predictive Controller

As the main algorithm of longitudinal following control, MPC is constructed based on the velocity and position information of the ego and the front vehicle, and outputs the acceleration that meets the constraints [31]. The longitudinal motion planning problem in this study is developed based on a longitudinal kinematic model as follows:

$$\dot{X} = v_{X} ,\dot{V} = a_{X} ,$$
(1)

where \(X\)\(v_{X}\) and \(a_{X}\) represent the longitudinal displacement, velocity, and acceleration, respectively; the longitudinal displacement and velocity are denoted by the state variable and the output variable \(y\), and the longitudinal acceleration is the control variable \(u\):

$$x = \left[ {\begin{array}{*{20}c} X \\ {v_{X} } \\ \end{array} } \right],\quad y = \left[ {\begin{array}{*{20}c} X \\ {v_{X} } \\ \end{array} } \right],\quad u = a_{X} .$$
(2)

Then, the MPC longitudinal motion planning problem can be described as follows:

$$\begin{gathered} J_{t} \left( {x(0),u_{t - 1} ,\Delta u,\varepsilon } \right) = \sum\limits_{i = 1}^{{N_{p} }} {\left\| {y_{{t + it_{p} \left| t \right.}} - y_{{{\text{ref}},t + it_{p} \left| t \right.}} } \right\|_{Q}^{2} } \\ + \sum\limits_{j = 0}^{{N_{c} - 1}} {\left\| {u_{{t + jt_{c} \left| t \right.}} } \right\|_{{R_{u} }}^{2} } + \sum\limits_{i = 1}^{{N_{c} - 1}} {\left\| {\Delta u_{{t + it_{c} \left| t \right.}} } \right\|_{{R_{du} }}^{2} } + \rho \varepsilon^{2} , \\ \end{gathered}$$
(3)
$$\mathop {\min }\limits_{\Delta u,\varepsilon } J_{t} \left( {x(0),u_{t - 1} ,\Delta u,\varepsilon } \right),$$
$$\begin{gathered} {\text{s}}.{\text{t}}. \, {u_{\min}} \le u(k) \le u_{\max } ,k = 0,1, \cdots ,N_{c} - 1, \\ \Delta u_{\min } \le \Delta u(k) \le \Delta u_{\max } ,k = 0,1, \cdots ,N_{c} - 1, \\ x_{\min } - \varepsilon 1_{n \times 1} \le x(k) \le x_{\max } - \varepsilon 1_{n \times 1} ,k = 0,1, \cdots ,N_{p} , \\ y_{\min } - \varepsilon 1_{p \times 1} \le x(k) \le y_{\max } - \varepsilon 1_{p \times 1} ,k = 0,1, \cdots ,N_{p} , \\ 0 \le \varepsilon (k) \le \varepsilon_{\max } , \\ \end{gathered}$$

where,

$$\begin{gathered} {\text{s}}{.}{\text{t}}. \, {u_{\min}} \le u(k) \le {u_{\max}} ,k = 0,1, \cdots, {N_{c}} - 1, \\ \Delta {u_{\min}} \le \Delta u(k) \le \Delta {u_{\max}} ,k = 0,1, \cdots, {N_{c}} - 1, \\ {x_{\min}} - \varepsilon 1_{n \times 1} \le x(k) \le {x_{\max}} - \varepsilon 1_{n \times 1} ,k = 0,1, \cdots, {N_{p}} , \\ {y_{\min}} - \varepsilon 1_{p \times 1} \le x(k) \le {y_{\max}} - \varepsilon 1_{p \times 1}, k = 0,1, \cdots, {N_{p}}, \\ 0 \le \varepsilon (k) \le \varepsilon_{\max} . \\ \end{gathered}$$

The symbols of the relevant parameters are shown in Table 2.

Table 2 Parameters and symbols of MPC problems

The desired value of longitudinal position \(y_{{{\text{ref}},t + it_{p} \left| t \right.}}\) is jointly decided on the position of the preceding vehicle \(X_{{{\text{f}},t + it_{p} \left| t \right.}}\) and the safe stopping distance \(D_{{{\text{safe}},t}}\), and is affected by the velocity \(V_{{{\text{f}},t + it_{n} \left| t \right.}}\) and acceleration \(a_{{{\text{f}},t + it_{p} |t}}\) of the preceding vehicle.

In the objective function, the first item reflects the longitudinal CF safety requirements and the tracking ability to the desired values. The second reflects the fuel economy requirements, that is, the ability to suppress excessive values of longitudinal acceleration. The third shows the ride comfort requirements, the ability to limit the excessive values of longitudinal jerk. The fourth item is used to prevent that the optimization problem will have no solution due to the error of the prediction model.

In this paper, we trained the weight of the first term in the MPC objective function by reinforcement learning, so as to dynamically adjust the expected longitudinal CF distance and improve the scene adaptability of vehicles.

4.2 Reinforcement Learning Algorithm

4.2.1 Action Space

The action \(a_{t} \in {\text{A}}\) is constituted by the CF weight coefficient \(Q\), i.e., \(a_{t} = [Q]\), \(Q \in \left[ {Q_{\min } ,Q_{\max } } \right]\), where \(Q_{\min } ,Q_{\max }\) correspond to the maximum and minimum values of the CF weight respectively.

4.2.2 State Space

The state \(s_{t} \in S\) is composed of the relative distance, the relative velocity and the longitudinal velocity of the ego vehicle and the preceding vehicle, \(s_{t} = \left[ {R_{v} ,R_{d} ,v_{\text{ego}}} \right]^{{\text{T}}}\).

4.2.3 Reward Space

Four parts are used to define the reward as follows:

$$r(s,a) = r_{{\text{collision }}} + r_{U} + r_{\Delta U} + r_{D} .$$
(4)

Collision reward \(r_{{{\text{collision}}}}\): Once a collision happens, the car will get a negative reward \(r_{c}\), and \(r_{c}\) is a constant set to −10.

$$r_{{\text{collision }}} = \left\{ {\begin{array}{*{20}c} 0, & {\text{ not collision, }} \\ {r_{c} }, & {{\text{ collsion}}{. }} \\ \end{array} } \right.$$
(5)

Acceleration reward \(r_{U}\): If the vehicle satisfies the acceleration constraint of the MPC problem, it will get a reward \(r_{U}\), \(r_{U}\) consists of four cases.

$$r_{U} = \left\{ {\begin{array}{*{20}l} {r_{u1} + k_{1} \times |u|,} \hfill & {\text{ safe, }} \hfill & {} \hfill \\ {r_{u2} + k_{2} \times u,} \hfill & {\text{ high danger, }} \hfill & {u \in \left[ {u_{\min \, } ,u_{\max } } \right],} \hfill \\ {r_{u3} ,} \hfill & {\text{ low danger, }} \hfill & {} \hfill \\ {r_{u4} ,} \hfill & {} \hfill & {u \notin \left[ {u_{\min \, } ,u_{\max } } \right].} \hfill \\ \end{array} } \right.$$
(6)
  1. 1.

    When in the safety scenario, the reward is defined to travel at a lower acceleration, where \(r_{u1}\) and \(k_{1}\) are constants set to 30 and −10;

  2. 2.

    When in the high-risk scenario, the reward encourages the vehicle to travel at a higher deceleration to avoid a dangerous situation. In addition, it penalizes the acceleration behavior to prevent the danger, where \(r_{u2}\) and \(k_{2}\) are constants set to −10 and 1;

  3. 3.

    When in the low-risk scenario, it gets a reward \(r_{u3}\), where \(r_{u3}\) is a constant set to 0;

  4. 4.

    Once the vehicle does not meet the acceleration constraint of the MPC problem, it will get a negative reward \(r_{u4}\) to punish the over-constrained behavior, where \(r_{u4}\) is a constant set to −100.

Jerk reward \(r_{{{\Delta U}}}\): If the vehicle does not meet the jerk constraint of the MPC problem, it will get a negative reward \(r_{{{\Delta }u}}\), where \(r_{{{\Delta }u}}\) is a constant set to −30.

$$r_{\Delta U} = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {\Delta u \in \left[ {\Delta u_{\min } ,\Delta u_{\max } } \right],} \hfill \\ {r_{\Delta u} ,} \hfill & {\Delta u \notin \left[ {\Delta u_{\min } ,\Delta u_{\max } } \right].} \hfill \\ \end{array} } \right.$$
(7)

CF reward \(r_{D}\): When the vehicle is in the crisis scenario, it will get a negative reward \(r_{d}\). Its purpose is to enable the autonomous vehicle to guarantee the basic CF task, where \(r_{d}\) is a constant set to −41.

$$r_{D} = \left\{ {\begin{array}{*{20}c} {0,} & {\text{ not fellow risk, }} \\ {r_{d} ,} & {{\text{ fellow risk}}{. }} \\ \end{array} } \right.$$
(8)

4.2.4 State Transition Probability

The vehicle in state \(s_{t}\) takes action \(a_{t}\), and the state \(s_{t}\) transfers to state \(s_{t + 1}\). The probabilistic state transition mode is denoted as follows:

$$P_{s}^{a} = P\left[ {s = s_{t + 1} |s = s_{t} ,a = a_{t} } \right].$$
(9)

4.2.5 Value Function

In this paper, we choose a soft actor-critic (SAC) algorithm that optimizes random strategies in a non-strategic way to make the value function optimal [32].

First, introduce the definition of entropy, define x to be a stochastic variable with probability mass or density function. The entropy H of x is calculated from its distribution P according to:

$$H(P) = \mathop E\limits_{x \sim P} [ - \log P(x)].$$
(10)

In entropy-regularized RL, the agent gets an additional bonus at each time step relative to the entropy of the policy at the same timestep. This changes the RL problem to:

$$\pi^{ * } = \arg \mathop {\max }\limits_{\pi } \mathop E\limits_{\tau \sim \pi } \left[ {\mathop \sum \limits_{t = 0}^{\infty } \gamma^{t} \left( {R\left( {s_{t} ,a_{t} ,s_{t + 1} } \right) + \alpha H\left( {\pi \left( { \cdot |s_{t} } \right)} \right)} \right)} \right].$$
(11)

Where, \(\alpha\) is the entropy regularization coefficient, which particularly controls the explore-exploit tradeoff, with positive and negative correlation with exploration. \(\gamma\) is the discount factory, \(\gamma \in [0,1]\).

The value function is as follows:

$$V^{\pi } (s) = \mathop E\limits_{\tau \sim \pi } \left[ {\mathop \sum \limits_{t = 0}^{\infty } \gamma^{t} \left( {R\left( {s_{t} ,a_{t} ,s_{t + 1} } \right) + \alpha H\left( {\pi \left( { \cdot |s_{t} } \right)} \right)} \right)|s_{0} = s} \right].$$
(12)

The \(Q\) function corresponds to:

$$\begin{gathered} Q^{\pi } (s,a) = \\ \mathop E\limits_{\tau \sim \pi } \left[ {\mathop \sum \limits_{t = 0}^{\infty } \gamma^{t} R\left( {s_{t} ,a_{t} ,s_{t + 1} } \right) + \alpha \mathop \sum \limits_{t = 1}^{\infty } \gamma^{t} H\left( {\pi \left( { \cdot |s_{t} } \right)} \right)|s_{0} = s,a_{0} = a} \right]. \\ \end{gathered}$$
(13)

Combining the above definitions, we get:

$$V^{\pi } (s) = \mathop E\limits_{a \sim \pi } \left[ {Q^{\pi } (s,a)} \right] + \alpha H(\pi ( \cdot |s)),$$
(14)

in addition, the bellman equation of \(Q^{\pi } (s,a)\) is:

$$\begin{aligned} Q^{\pi } (s,a) & = \mathop E\limits_{{\begin{array}{*{20}c} {s^{\prime} \sim P} \\ {a^{\prime} \sim \pi } \\ \end{array} }} \left[ {R\left( {s,a,s^{\prime}} \right) + \gamma \left( {Q^{\pi } \left( {s^{\prime},a^{\prime}} \right) + \alpha H\left( {\pi \left( { \cdot s^{\prime}} \right)} \right)} \right)} \right] \\ & = \mathop E\limits_{s^{\prime} \sim P} \left[ {R\left( {s,a,s^{\prime}} \right) + \gamma V^{\pi } \left( {s^{\prime}} \right)} \right], \\ \end{aligned}$$
(15)

rewrite it with the definition of entropy:

$$\begin{aligned} Q^{\pi } (s,a) & = \mathop E\limits_{{\begin{array}{*{20}c} {s^{\prime} \sim P} \\ {a^{\prime} \sim \pi } \\ \end{array} }} \left[ {R\left( {s,a,s^{\prime}} \right) + \gamma \left( {Q^{\pi } \left( {s^{\prime},a^{\prime}} \right) + \alpha H\left( {\pi \left( { \cdot |s^{\prime}} \right)} \right)} \right)} \right] \\ & = \mathop E\limits_{{\begin{array}{*{20}c} {s^{\prime} \sim P} \\ {a^{\prime} \sim \pi } \\ \end{array} }} \left[ {R\left( {s,a,s^{\prime}} \right) + \gamma \left( {Q^{\pi } \left( {s^{\prime},a^{\prime}} \right) - \alpha \log \pi \left( {a^{\prime}|s^{\prime}} \right)} \right)} \right]. \\ \end{aligned}$$
(16)

The right hand side is the expected value of the next action from the current policy as well as the next state from the replay buffer. Since it is an expectation, we can approximate it with samples:

$$\begin{gathered} Q^{\pi } (s,a) \approx r + \gamma \left( {Q^{\pi } \left( {s^{\prime},\tilde{a}^{^{\prime}} } \right) - \alpha \log \pi \left( {\tilde{a}^{^{\prime}} |s^{\prime}} \right)} \right),\quad \\ \tilde{a}^{^{\prime}} \sim \pi \left( { \cdot |s^{\prime}} \right). \\ \end{gathered}$$
(17)

The loss function of Q-network in SAC is:

$$L\left( {\phi_{i} ,{\mathcal{D}}} \right) = \mathop E\limits_{{\left( {s,a,r,s^{\prime},d} \right) \sim {\mathcal{D}}}} \left[ {\left( {Q_{{\phi_{i} }} (s,a) - y\left( {r,s^{\prime},d} \right)} \right)^{2} } \right],$$
(18)

where the target y is given by:

$$\begin{gathered} y\left( {r,s^{\prime},d} \right) = r + \gamma (1 - d)\left( {\mathop {\min }\limits_{j = 1,2} Q_{{\phi_{t\arg .j} }} \left( {s^{\prime},\tilde{a}^{\prime } } \right) - \alpha \log \pi_{\theta } \left( {\tilde{a}^{\prime } \left| s \right.^{\prime}} \right)} \right), \\ \tilde{a}^{^{\prime}} \sim \pi_{\theta } \left( { \cdot \left| s \right.^{\prime}} \right). \\ \end{gathered}$$
(19)

The strategy should act to maximize the sum of expected future benefits and entropy in each state. That is, it should maximize \(V^{\pi }\), which this paper expands out into:

$$\begin{gathered} V^{\pi } (s) = \mathop E\limits_{a \sim \pi } \left[ {Q^{\pi } (s,a)} \right] + \alpha H(\pi ( \cdot |s)) \\ = \mathop E\limits_{a \sim \pi } \left[ {Q^{\pi } (s,a) - \alpha \log \pi (a|s)} \right]. \\ \end{gathered}$$
(20)

The way we optimize the policy makes use of the reparameterization trick, in which a sample from \(\pi_{\theta } ( \cdot |s)\) is derived by calculating a deterministic function of policy parameters, state, and independent noise. To illustrate: We use a squashed Gaussian policy, which means that samples are obtained, according to

$$\tilde{a}_{\theta } (s,\xi ) = \tanh \left( {\mu_{\theta } (s) + \sigma_{\theta } (s) \odot \xi } \right),\quad \xi \sim {\mathcal{N}}(0,I).$$
(21)

The re-parameterization technique allows us to rewrite the operational expectation into the noise expectation:

$$\begin{aligned} & \mathop E\limits_{{a \sim \pi_{\theta } }} \left[ {Q^{{\pi_{\theta } }} (s,a) - \alpha \log \pi_{\theta } (a|s)} \right] \\ &= \mathop E\limits_{{\xi \sim {\mathcal{N}}}} \left[ {Q^{{\pi_{\theta } }} \left( {s,\tilde{a}_{\theta } (s,\xi )} \right) - \alpha \log \pi_{\theta } \left( {\tilde{a}_{\theta } (s,\xi )|s} \right)} \right]. \\ \end{aligned}$$
(22)

To get the policy loss, the last step is that we need to substitute \(Q^{{\pi_{\theta } }}\) with one of our function approximators, the policy is thus optimized according to Eq. (23):

$$\mathop {\max }\limits_{\theta } \mathop E\limits_{{\begin{array}{*{20}c} {s \sim {\mathcal{N}}} \\ {\xi \sim {\mathcal{N}}} \\ \end{array} }} \left[ {\mathop {\min }\limits_{j = 1,2} Q_{{\phi_{j} }} \left( {s,\tilde{a}_{\theta } (s,\xi )} \right) - \alpha \log \pi_{\theta } \left( {\tilde{a}_{\theta } (s,\xi )\left| s \right.} \right)} \right].$$
(23)

5 Simulation Results

This section briefly introduces the configuration of the simulation environment and the design of training process, followed by the result demonstration and analysis. In this paper, a RL environment is built in Carla simulator (an open source software) to implement and test the proposed strategy. In order to verify the performance of the proposed combined controller with a dynamic scene adaptability, this paper compares the variable-weight MPC controller with the fixed-weight MPC controller under three different operating conditions. The scenarios involved in the training process can be classified into the four risk levels mentioned above. The relevant initial conditions and parameters used in the simulation are shown in Table 3.

Table 3 Main simulation parameters

The RL algorithm is trained with random seeds for 120000 iterations per evaluation, and Figure 4 shows the cumulative rewards for the average reward \(\overline{r}\) at each step during the evaluation period. The maximum possible reward for each step is 30, and the reward decreases when the agent deviates from the expected acceleration and jerk. The results show that after 50000 training steps, the vehicle is able to learn how to perform better. As the training moves on, the average reward \(\overline{r}\) continues to increase until around 100000 steps. During training, the set ends when (1) a collision occurs, (2) the relative distance is greater than the maximum CF distance, (3) the CF task ends or (4) the maximum number of iterative steps per set of 750 timesteps is reached.

Figure 4
figure 4

Learning curve

Then, the test conditions are defined. The initial speed of the ego is set to 0, and the initial position is fixed. The initial speed of the preceding vehicle is randomly set, and it successively experiences three driving conditions of acceleration, steady state and deceleration, the specific process is shown in Table 4. The \(D_{{{\text{fellow}}}}\) in this paper is 60 m, to extensively test the proposed CF algorithm performance, this paper set three working conditions with initial relative distances of 50 m, 60 m, and 70 m, respectively. The training results are shown in Figures 5, 6, 7, 8, 9, 10, 11, where the black curves indicate the following results of the fixed weight coefficient MPC controller and the red shows the results of the variable weight MPC controller with dynamic scene adaptability.

Table 4 Driving process of the preceding vehicle
Figure 5
figure 5

Relative distance

It is verified that this strategy can make the driving process basically within the following safety scenario through the adaptive adjustment of the vehicle, regardless of whether the initial relative distance is greater than \(D_{{{\text{fellow}}}}\). Even in the emergency deceleration stage of the front vehicle, the ego vehicle is only unavoidably caught in the following danger scenario for a short time, and then immediately gets rid of the dilemma, as shown in Figure 5. Among them, the adjustment curve of weight \(Q\) is shown in Figure 6, which can adjust its trend of increase or decrease by judging the state of environment. If the vehicle is maintaining the safe following distance, weight \(Q\) will reduce to improve the comfort and energy saving. As the possibility of the ego car getting into a following crisis situation increases when the front car accelerates, the \(Q\) value is gradually increased to ensure that the following distance is within the desired range. And the larger the initial relative distance is, the more likely it is to lead to CF task failure, so the growth trend of \(Q\) is positively correlated with the initial relative distance, as shown in Figure 7. In the low and high-risk scenario, the vehicle with this strategy rapidly increases the distance to the front vehicle through adjusting the weight \(Q\), effectively avoiding the collision and reflecting the good safety.

Figure 6
figure 6

Weight coefficient \(Q\)

Figure 7
figure 7

Variation of the weight coefficient \(Q\) under different initial distances

Compared to the vehicle with conventional MPC controller, the vehicle under this strategy maintains a smaller relative velocity to the preceding vehicle for most of the time. As one of the causes of driver sight distance jitter, the reduction of relative velocity is necessary. In addition, the relative velocity reduction has little effect on the vehicle velocity and always satisfies the controller constraints. That is, the strategy can improve the comfort while ensuring the vehicle traffic efficiency, as shown in Figures 8 and 9.

Figure 8
figure 8

Relative velocity

Figure 9
figure 9

Vehicle velocity

In addition, as shown in Figures 10 and 11, the acceleration curves and jerk curves show that the adaptive adjustment strategy tends to follow the preceding vehicle with less acceleration in the safety scenario. It allows the vehicle to have the opportunity to pursue higher ride comfort and fuel economy while maintaining safety. Besides, this is in line with human habits and expectations, demonstrating its intelligence. When the vehicle is caught in a high-risk scenario due to sudden changes in environmental conditions, the strategy has better performance in terms of the state prediction and response speed. The vehicle with this strategy performed faster and decelerate more, and it is satisfying the controller constraint, that is the vehicle has good emergency handling capability to avoid the collision in a timely and effective manner. It indicates that the strategy has good adaptability in dynamic scenarios, which helps to improve the safety of the vehicle.

Figure 10
figure 10

Vehicle acceleration

Figure 11
figure 11

Vehicle jerk

The training results show that the variable weight controller guided by the proposed combined strategy is able to adjust the vehicle performance priority in dynamic scenarios. Compared to the traditional controller with fixed weights, the proposed controller can adjust adaptively and has better performance, such as when in the crisis scenario, it can respond quickly and accelerate to keep up with the preceding vehicle. The probability of following the vehicle under the safety scenario is higher, and the riding comfort and fuel economy of the vehicle are effectively improved through weight adjustment. In dangerous scenarios such as the low-risk and the high-risk scenarios, the vehicle's emergency response ability is better, and collisions can be avoided by decelerating in time.

6 Conclusions

  1. (1)

    The results of this paper have verified the effectiveness of using RL in combination with traditional MPC controller to construct an adaptive controller for autonomous vehicles driving in dynamic scene.

  2. (2)

    The effectiveness of the proposed method was verified in the Carla environment with a high-fidelity vehicle model as the main control object. According to the results, when compared with the traditional controller, the autonomous vehicle based on this method could brake quickly in dangerous scenarios for the sake of safety, and it has a more stable acceleration performance and better ride comfort and fuel economy. Accordingly, ability of autonomous vehicles to trade-off safety, comfort and energy efficiency is significantly improved.

  3. (3)

    A risk threshold model is developed to classify scenes based on feature information and guide the design of the RL reward function, which helps to accelerate the convergence process of RL and the probability of finding a more optimal solution.

  4. (4)

    The study found that the adjustment of MPC weight coefficient has a direct impact on vehicle performance and the adjustment of tracking weight \(Q\) and its effect are described. To enhance the controller’s scene adaptability, when designing the adjustment rules, it is necessary to fully consider the appropriate scene risk assessment method, reasonable safe following distance and safe stopping distance meeting the mechanical requirements.

There are still some problems to be further studied in the combined strategy of RL and MPC for autonomous. In the future, we plan to extend the adaptive adjustment method for vehicle lane change in complex environments.