Keywords

1 Introduction

Subject to the inadequacies of fully autonomous driving technology, it is still the mainstream practice to retain human drivers in the vehicle control loop [1]. Nowadays a new technical architecture called human-machine collaborative driving has emerged as the times require. Through close collaboration in vehicle motion planning and control, the integrated system can easily benefit from the hybrid intelligence of the human and the machine [2].

In order to achieve the coordination, traditional methods for driving authority allocation usually include constructed functions, model predictive control, fuzzy system and game theory [3]. Recently, with the development of machine learning and neural networks, several researchers have attempted to use such techniques to settle the co-driving problem. For example, [4] proposed a shared steering control framework based on miscellaneous RL methods to achieve a flexible and efficient path-following maneuver. In [5], a lane change decision-making strategy was developed with deep Q-learning, in which the driving risk was evaluated by probabilistic models beforehand. Nevertheless, few of the existing works pay attention to the control freedom or the realization of individual preference of the driver, thus is detrimental to volume up the superiority of human intelligent in the hybrid system.

A human-centered collaborative driving paradigm adheres to the minimal intervention principle [6], which means the machine partner only intervenes as necessary. Otherwise, the human entities are allowed to do whatever they want, such as choosing the desired path and speed, under the premise of safety. Therefore, it improves the driving flexibility in face of some ambiguous environments, and also facilitates the user acceptance of the assistance system. Several human-centered shared control schemes have been elaborated in the robotics domain, like [7] and [8], but that relevant to the ground vehicle is rare. To this end, the main contributions of this paper are concluded as:

  • A novel human-centered collaborative driving scheme is proposed, which is the first effort to achieve human-machine coordination in the integrated decision-making and control links with reinforcement learning (RL).

  • Two extended RL agents adapted for the collaborative driving task are devised and validated in a challenging obstacle avoidance scenario, which provides directions for structural optimization and training acceleration.

2 Driver-Vehicle System Modeling

Vehicle Modeling. The lateral dynamic characteristics of a vehicle can be represented by a 2-DOF bicycle model:

$$\begin{aligned} \dot{\beta }=\frac{2}{Mu}\left[ C_f\delta -(C_f+C_r)\beta +\frac{-aC_f+bC_r}{u}r\right] -r \end{aligned}$$
(1a)
$$\begin{aligned} \dot{r}=\frac{2}{I_z}\left[ aC_f\delta -(aC_f-bC_r)\beta -\frac{a^2C_f+b^2C_r}{u}r\right] \end{aligned}$$
(1b)

where M is the vehicle mass; \(I_z\) is the yaw inertia; a and b are the distances of the front and rear axles from the center of gravity; \(C_f\) and \(C_r\) are the cornering stiffnesses of the front and rear tires; \(\beta \) is the sideslip angle; r is the yaw rate. The longitudinal speed is u and the system incentive is the front steering angle \(\delta \). Note that the vehicle model is only used for environmental simulation during training and validation since the RL-based method is a model-free approach. Therefore, the model uncertainty will not downgrade the performance of the controller.

Driver Modeling. The optimal preview model is utilized to describe the driver steering control behavior during the path-following process. At an instant \(t_0\), the optimal steering angle can be obtained by:

$$\begin{aligned} \delta _d^*=\frac{2(a+b)}{d^2}\left[ f(t_0+\frac{d}{u})-y(t_0)-\frac{\dot{y}(t_0)d}{u}\right] \end{aligned}$$
(2)

where d is the preview distance; f is the reference path; y, \(\dot{y}\) is the lateral displacement and velocity respectively. Likewise, the driver model is not necessary a priori knowledge for the controller design, but plays a role in the interaction with the RL agent.

3 Reinforcement Learning Approach

The collaborative driving control can be considered as a Markov decision process (MDP), which is denoted by a tuple \((\mathcal {S},\mathcal {A},P,R,\gamma )\) composed of states \(\mathcal {S}\), actions \(\mathcal {A}\), transitions P, rewards R and discount factor \(\gamma \in [0,1]\). An optimal policy \(\pi ^*\) that maximizes the expected discounted return in the future can be found through a training process with interaction of the external environment. The policy is usually represented by a parametric neural network.

Observation. The agent observation in this paper consists of the driver action \(a_H=[\delta _d, \dot{\delta }_d]\) and the environment states \(o_E\). The environment states contain positions and status of both ego vehicle and surroundings, including information of lanes, road boundaries and obstacles.

Action. The action space is discussed as two situations here. Considering the end-to-end capability of neural networks, the agent output can be defined as either the steering angular velocity or the target lateral displacement of the ego vehicle. For the latter, a low-level Stanley controller [9] is employed to figure out the executable front-wheel steering angle.

Reward. The step reward is comprised of human reward \(r_H\) and environmental reward \(r_E\), which is given by:

$$\begin{aligned} r_H=k_1e^{-\sigma _1(\delta -\delta _d)^2} \end{aligned}$$
(3a)
$$\begin{aligned} r_E=k_2e^{-\sigma _2{d_c}^2}-k_3e^{-\sigma _3{d_o}^2} \end{aligned}$$
(3b)

where \(d_c\) is the offset of the ego vehicle to the lane centerline; \(d_o\) is the distance to the nearest obstacle; \(k_1\), \(k_2\), \(k_3\) are weighting coefficients; \(\sigma _1\), \(\sigma _2\), \(\sigma _3\) are adjustable softmax coefficients.

Policy Gradient. To obtain the optimal policy, twin delayed deep deterministic policy gradient (TD3) is adopted in this paper. TD3 establishes two Q-function networks \(Q_{\theta _1}\), \(Q_{\theta _2}\) as the critic and a deterministic policy network \(\pi _{\theta }\) as the actor, which is updated by the policy gradient:

$$\begin{aligned} \nabla _{\theta }J(\pi )=\frac{1}{N}\sum \nabla _a Q_{\theta _1}(s,a)\mid _{a=\pi _{\theta }(s)}\nabla _{\theta }\pi _{\theta }(s) \end{aligned}$$
(4)

where J is the return function and N is the number of transitions in a mini-batch. Figure 1 shows the overall framework of the proposed collaborative driving scheme.

Fig. 1.
figure 1

An overview of the collaborative driving control loop using RL.

Besides vanilla TD3 (where the agent output is the direct steering action), two extended versions are also developed and investigated in this paper: TD3-SC and TD3-SF. Both of them has target lateral displacement as their action space with a subsequent Stanley tracker. The difference is that TD3-SC agent is trained under changeable episode steps, which means an episode ends as collision occurs, while TD3-SF agent is trained under fixed episode steps (collision does not abort the episode). Performance of the three agents will be compared and discussed in the next section.

4 Validation

Fig. 2.
figure 2

The testing environment of the collaborative driving scheme.

The collaborative driving agents are trained and validated in the scenario shown in Fig. 2. As illustrate by the diagram, there exist two feasible paths for the vehicle to bypass the obstacles. These paths serve as the reference trajectories for human drivers. In addition, one straight-line path representing the driver takes no action in face of the oncoming obstacle is also involved. Basically, the driver randomly choose one of the three reference paths for each episode. It is expected that with the assistance of the collaborative control, the vehicle can travel along the desired path consistent with driver’s intention when there is no risk of collision, or actively steer to avoid the obstacle in case of danger (Fig. 4).

Table 1. Comparison of the training time for the three agents.
Fig. 3.
figure 3

Trajectory plots for various driver’s reference choices.

Fig. 4.
figure 4

Action plots for various driver’s reference choices.

As depicted in Fig. 3, for path ① and ②, all three agents can follow the reference and make the correct steering decisions that align with the driver’s intention. However, the trajectories of vanilla TD3 have more deviations than the others, owing to the arbitrary variations and jitters in the steering angle output, which is also plotted in Fig. 3 accordingly. For path ③, it can be seen that all agents successfully bypass the obstacle, but some of them choose to turn left while others choose to turn right. This is also interpretable that the agents stochastically make their own strategies due to the symmetrical nature of the field. Besides, Table 1 lists the total training time for the three agents to convergence. Obviously, TD3-SF has the highest training efficiency as well as the best path-following accuracy among the three agents.

5 Conclusion

In this paper, a novel human-centered collaborative steering strategy based on RL is proposed and validated in an obstacle avoidance driving scenario. The result shows that the RL-based controller can effectively decode driver’s intention from the steering behavior and correct the risky action to enhance the driving safety. In the comparison of three different agents, the adjustment of agent output to target displacement helps to improve the stability of steering control, while the fixed-step discipline can greatly increase the convergence speed. Future work may include to explore more advanced RL algorithms like soft actor-critic (SAC), and to conduct additional driver-in-the-loop experiments where real-world human drivers are engaged.