Keywords

1 Introduction and Related Work

The ability to control a vehicle in extreme driving conditions, such as the powerslide, where large vehicle sideslip angles, large traction forces and large negative steering angles occur, is of great interest to the automotive industry. Since the powerslide is an unstable motion, [4] addresses the observability and controllability characteristics for both rear-wheel drive (RWD) and all-wheel drive (AWD) vehicles. Depending on the drive concept, the powerslide can be stabilised in different ways. In recent years, Reinforcement Learning (RL), a data-driven control approach, has become increasingly popular and has been considered to stabilise the powerslide in this work. In [1] and [9], RL is used to control the steering wheel angle and the drive train in simulation, while in [2], RL-based controllers are successful tested on radio-controlled (RC) model cars in real world. The development of battery electric vehicles (BEVs) and the possibility of using individual electric motors on each axle opens up new control strategies. While previous works considered the autonomous drift controlling both the vehicle’s drive train and the steering system, [3] proposes a linear controller that controls front and rear axle torques in the presence of a human driver. This controller shows great performance in simulation; however, when tested in the real world, it requires application effort to handle specific road conditions. In this work, a novel RL-based controller is proposed for an AWD BEV to control the individually driven front and rear axles with a human driver in the loop. Moreover, the robustness of the proposed controller in terms of steering disturbances from the driver and sudden road friction changes is analysed. Further, the controller is integrated into a test vehicle and the control performance is proven in a real-world test case. The remainder of this paper is structured as follows. In Sect. 2, the general problem setup, the vehicle model and the RL problem are introduced, while in Sect. 3, the test configuration and the results in simulation and in real world are presented. Section 4 gives a brief summary and an outlook.

2 Problem Formulation

This section introduces the vehicle and driver model used in the simulation environment, and the general methodology of RL.

Vehicle and Driver Model. The nonlinear two-wheel vehicle model in Fig. 1 at time step \(t \in \mathbb {N}_{0}\)

$$\begin{aligned} \boldsymbol{x}_{t+1} = {\textbf {f}}_{\text {car}}\left( \boldsymbol{x}_t, \boldsymbol{u}_t \right) , \end{aligned}$$
(1)

is considered, with vehicle state \(\boldsymbol{x} \in \mathcal {X}\) and input \(\boldsymbol{u} = \left[ \delta , T_{\text {front}} ,T_{\text {rear}} \right] ^T\in \mathcal {U}\), where \(T_{\text {front}}\) and \(T_{\text {rear}}\) denote the front and rear axle torques, respectively. In the presence of a human driver, the steering angle \(\delta \) is defined by the nonlinear driver model [6]

$$\begin{aligned} \delta = {\textbf {f}}_{\text {driver}}\left( \beta , v, e, \varDelta \psi \right) \end{aligned}$$
(2)

based on vehicle sideslip angle \(\beta \), velocity v, lateral deviation e and orientation \(\varDelta \psi \) relative to the desired path.

Fig. 1.
figure 1

Two-wheel vehicle model with independently driven front and rear axles.

Control Goal. While the driver (model) in (2) focuses only on path-tracking by applying \(\delta \), the RL-based controller’s task is to initiate and stabilise the powerslide by applying \(T_{\text {front}}\) and \(T_{\text {rear}}\). Although the control tasks are separated, they are expected to interfere with each other.

Reinforcement Learning. In RL, an agent learns a policy based on the interaction with its environment. The agent acts on the environment using control \(\boldsymbol{u}_t \in \mathcal {\hat{U}} = \mathcal {U} \backslash \lbrace \delta \rbrace \) sampled from policy \(\boldsymbol{\pi }_{\theta }(\boldsymbol{u} \vert \boldsymbol{x})\) with policy parameters \(\theta \) based on the current environment state \(\boldsymbol{x}_t\). The agent observes the next environment state \(\boldsymbol{x}_{t+1}\) and receives a reward \(r_{t+1}\) defined by the reward function R associated with the tuple \([\boldsymbol{x}_t, \boldsymbol{u}_t, \boldsymbol{x}_{t+1}]\). A Markov Decision Process (MDP) described by \(\langle \mathcal {X},\mathcal {\hat{U}}, R, {\textbf {f}}_{\text {car}}, \mathcal {X}_0 \rangle \) is assumed, where \(\mathcal {X}_0 \subseteq \mathcal {X}\) denotes the initial state distribution. Starting from an initial state \(\boldsymbol{x}_0\in \mathcal {X}_0\), the MDP forms a trajectory \(\tau \) of states, actions and rewards. The central objective is to find an optimal control policy \(\pi ^*\) that maximises the expected sum of discounted rewards

$$\begin{aligned} \pi ^* = \arg \max _{\pi _{\theta }} \underset{\tau \sim \pi }{\mathbb {E}}\left[ \sum \nolimits _{t=0}^{\infty }\gamma ^tr_t\right] \end{aligned}$$
(3)

with discount factor \(\gamma \in [0,1]\) balancing the present impact of future rewards. To find policy parameters \(\theta \), optimisation problem (3) can be solved using policy gradient methods, e.g. Proximal Policy Optimization (PPO) [7, 8].

Observation Space and Action Space. At each time step, only a subset of the entire environment state is visible to the agent. The observation space \(\textbf{o} = [\beta \,\, \dot{\psi } \,\, v \,\, \omega _{\text {front}} \,\, \omega _{\text {rear}} \,\, \delta \,\, \beta _{\text {target}} \,\, \beta _{\text {ref}}]^T\) comprises vehicle sideslip angle \(\beta \), yaw rate \(\dot{\psi }\), velocity v, angular speed of front axle \(\omega _{\text {front}}\) and rear axle \(\omega _{\text {rear}}\), respectively, and the steering angle \(\delta \). Moreover, the agent receives information about the target steady-state vehicle sideslip angle \(\beta _{\text {target}}\) and the predefined vehicle sideslip angle reference trajectory \(\beta _{\text {ref}}\) on how to reach \(\beta _{\text {target}}\). The vehicle sideslip angle reference is a ramp function with a slope of \(-9^{\circ }/\text {s}\), derived from expert knowledge, converging to \(\beta _{\text {target}}\). While the first six entries in \(\textbf{o}\) reflect sensor information available in the car, the last two entries are required to fulfil the control task. To stabilise the powerslide, the agent can individually control both the front axle torque \(T_{\text {front}}\) and the rear axle torque \(T_{\text {rear}}\), however only positive torque values are feasible, which excludes the possibility of braking.

Reward Function. During training, the agent tries to learn a policy that maximises the reward function. To simultaneously encourage the agent to stabilise the powerslide and to stay on the circular path,

$$\begin{aligned} R\left( \beta ,e\right) = \sum \nolimits _{i=1}^{2}w_i R_i = w_{\text {slip}}R_{\text {slip}}\left( \beta \right) + w_{\text {path}}R_{\text {path}}\left( e\right) , w_i \in [0,1] \end{aligned}$$
(4)

is chosen as a weighted sum of the reference vehicle sideslip angle tracking and the trajectory following reward \(R_{\text {slip}}\) and \(R_{\text {path}}\), respectively. The reward terms \(R_{\text {i}}= \text {exp}\left( - c_{\text {i}} \varDelta _{\text {i}}^2\right) \) with the deviation of the control target \(\varDelta _{\text {i}}\) and shaping parameter \(c_{\text {i}}\) ensure a positive learning signal.

3 Experiments

The agent is trained in simulation and evaluated in simulation and in real world.

Training and Network Architecture. Throughout the training, two multilayer perceptron (MLP) networks are trained, one for the policy and one for the value function. Both networks share the same architecture, namely three hidden layers and use Exponential Linear Unit (ELU) activation functions. Before the observations are passed into the policy network, they are normalised to the range [–1,1]. The output of the policy network is clipped to the range [–1,1] and scaled to the admissible torques. To improve exploration during the training, generalised state-dependent exploration (gSDE) is used. The learning rate is set to \(2\textrm{e}{-4}\) and the discount factor to 0.9999. ADAM is used to optimise the networks [5]. To accelerate the training, 40 environments are run in parallel. A training episode is terminated when it exceeds 40 seconds. However, it is prematurely terminated when the vehicle leaves the track or when the difference between the current vehicle sideslip angle and the reference exceeds a certain threshold. A rollout buffer size of 409600 and a batch size of 5120 are used.

3.1 Testing

The trained controller is tested in both simulation and real world.

Fig. 2.
figure 2

Simulation of the powerslide with a target vehicle sideslip angle of \(-30^{\circ }\) on a circular path with radius \(R=60\,\text {m}\) and friction coefficient \(\mu = 0.21\).

Simulation. The controller is trained on different configurations of the environment, where each configuration is randomly initialised. In the first example, the controller is exemplarily evaluated with a target vehicle sideslip angle of –30 deg on a circular path with radius \(R = 60\,\text {m}\) and friction coefficient \(\mu = 0.21\), see Fig. 2. The vehicle starts with an initial velocity of \(20\,\text {km/h}\) and transitions into a stable powerslide motion following the vehicle sideslip angle reference \(\beta _{\text {ref}}\). Figure 2 shows that the controller successfully learnt to transition the vehicle from regular steady-state cornering into powerslide and to stabilise the powerslide motion. To initiate the powerslide and to increase the vehicle sideslip angle, a high rear axle torque compared to the front axle torque is applied. Once the target vehicle sideslip angle is reached, front and rear torque converge to a fixed drive torque distribution of \(\gamma = T_{\text {rear}}/(T_{\text {rear}}+T_{\text {front}})=0.84\).

In the second example, the controller’s robustness to steering and road friction disturbances is focussed. In the first scenario, a disturbance of the steering angle \(\delta \) is considered, while in the second scenario, the road friction \(\mu \) is instantaneously increased (\(\mu _{\uparrow }\)) and decreased (\(\mu _{\downarrow }\)). The steering angle disturbance is represented by a shifted cosine function over a single period of \(0.5\,\text {s}\) with an amplitude of \(1.5^{\circ }\) towards the inside (\(\delta _{\uparrow }\)) and the outside (\(\delta _{\downarrow }\)) of the turn. For these evaluations, the environment configuration is adapted to radius \(R=18.5\,\text {m}\) and friction coefficient \(\mu = 0.35\), corresponding to the real-world setting. Figure 3 shows the vehicle sideslip angle trajectories resulting from the steering wheel angle disturbance and the change of the friction value. In both scenarios, the vehicle motion is successfully stabilised by the controller.

Fig. 3.
figure 3

Simulation of the powerslide with a target vehicle sideslip angle of \(-30^{\circ }\) on a circular path with radius \(R=18.5\,\text {m}\) and friction coefficient \(\mu = 0.35\). Robustness is analysed by applying a disturbance to the steering angle (upper plots) and a change of the road friction (lower plots) at time \(t=20\,\text {s}\).

Real World. The controller is deployed on a conventional, consumer-grade computer, which is connected to the vehicle’s embedded hardware. The test vehicle is a series production electric sports car. Vehicle sensor data and computed controls are exchanged between the computer and the vehicle via the XCP-protocol using prototype hardware. The controller runs cyclically with a sampling frequency of 100 Hz. For the control task, only built-in sensor signals of the vehicle are used, except for the vehicle sideslip angle, which is provided by an additional inertial measurement unit (IMU) mounted in the car. The measurements are collected on a watered circuit with \(18.5\,\text {m}\) radius and an estimated friction coefficient of 0.35. In the experiment, the vehicle starts in regular steady-state cornering with an initial speed of \(20\,\text {km/h}\) and after time step \(t=5\,\text {s}\), it transitions to the powerslide with a target vehicle sideslip angle of \(-30^{\circ }\), see Fig. 4. The controller stabilises the powerslide motion, however, oscillations of the vehicle side slip angle are present. This could be due to the driver influence, latencies or sensor noise in the control loop.

Fig. 4.
figure 4

Measurements of the powerslide with a target vehicle sideslip angle of \(-30^{\circ }\) on a circular path with \(18.5\,\text {m}\) radius and an estimated friction coefficient of 0.35. Vehicle sideslip angles \(\beta _1\) and \(\beta _2\) were recorded with different drivers.

4 Summary and Outlook

In this paper, an RL-based controller is developed to stabilise the powerslide of a vehicle with a human driver in charge of steering only. This is achieved by controlling the front and rear axle torques of an AWD BEV. The control performance of the powerslide controller on a circular path both in simulation and in the real world is demonstrated. For the latter case, the controller is tested in a series production electric sports car. The experiments clearly show that the proposed controller reacts appropriately to steering disturbances and instantaneous changes in the friction coefficient, revealing the robustness of the controller. Moreover, the tests prove that the controller, which was exclusively trained in simulation is also capable of stabilising the powerslide motion in real-world application. This indicates the capability of RL controllers to bridge the simulation to reality gap, since it has to deal with unmodelled real-world phenomena.

Future work should investigate the performance of RL-based controllers across a broader range of drivers, friction coefficients, and vehicle platforms.