Keywords

1 Introduction

1.1 Hybrid Commercial Vehicles

Hybrid commercial vehicles with hydrogen Fuel Cell System (FCS)are powered by a combination of fuel cell system and an electric propulsion system. We are considering hybrid vehicles in which both the power sources can drive the vehicle independently and simultaneously and the battery of the electric motor is charged from recuperation. Most of the power requirement is met by the FCS. This type of hybrid systems are called series-parallel hybrid system. Along with the FCS, electric motor and battery pack, this type of vehicle contains a power split device which manages the distribution of power between the FCS and the electric motor. The power split device runs on certain logic to split the total required power at a particular instance between the FCS and the electric motor. It also charges the battery from recuperation when the total required energy goes negative. We use deep reinforcement learning for deriving the logic to split the required power between the above power sources.

1.2 Reinforcement Learning

Reinforcement Learning (RL) is a machine learning technique where an agent learns to make decisions in an environment to maximize returns based on specific objectives. The environment is a bounded system that changes its state in response to the agent’s actions. An agent, an external entity, observes the system’s state and influences it to achieve and maintain a favorable state. Therefore, the agent’s objective is to learn the action-state dynamics and control the system with actions that result in the desired system state.

Rewards are defined to meet the system’s output objectives and are calculated based on the observed state of the system. During the learning phase, the agent observes the system’s state, performs actions to change it, and evaluates the new state by calculating the return. The return consists of the immediate reward and the discounted future reward at the new state. Over time, the agent adjusts its actions based on the returns to transition from the current state to a favorable state.

Artificial neural networks, which can learn complex non-linear relationships between variables, are used to represent the agent in deep reinforcement learning. These networks serve as universal function approximators, enabling the agent to learn and adapt effectively. The objective of the agent is to learn a policy , which is a mapping from state s to a probability distribution over action a, parameterized by a neural network \(\theta \), that maximizes the expected return \(\textbf{J}_{\theta } = \mathbb {E}_{\pi }[\mathbb {G}_t]\). The return\( \mathbb {G}_t = \sum _{k=0}^{\infty }\gamma ^k \mathbb {R}_{t+k+1} \) where \(\gamma \) is the discounting factor for future rewards and \(\mathbb {R}_{t+k+1}\) is the reward at time step \(t+k+1\). The policy is improved over iterations by updating \(\theta \) from the gradient of the return \(\nabla _{\theta }\mathbb {J}_{\theta } = \mathbb {E}_{\pi }\bigl [\nabla _{\theta }log \pi (a|s,\theta )\mathcal {Q}^{\pi }(s,a)\bigr ]\).

2 Related Work

Ferrara et al. [2] used quadratic optimization for optimal power split in hybrid commercial vehicles. Liessner et al. [6] addressed the power split problem using RL with Deep Deterministic Policy Gradient (DDPG) [7] algorithm. DDPG is a breakthrough in off-policy continuous action space RL applications, but is sensitive to hyper parameters [3]. Manio et al. [8] used Q-learning to address the problem with explicit inclusion of SoC conservation in the reward function. Hu et al. [4] uses deep Q-learning with reward function discretized on the value of SoC. Q-learning suffers from over estimation bias. We use Soft Actor Critic (SAC) [3] algorithm for learning the optimal power split. SAC is robust to hyper parameters and converges fast in high dimensional control problems. SAC avoids hyper parameter sensitivity by incorporating policy entropy term in policy update, thereby encouraging exploration.

3 Training Architecture

The objective of this exercise is to develop a model that optimally splits the total required power between the electric motor and the FCS. Optimal power split involves minimizing hydrogen fuel consumption, preventing battery drain, and battery charging when the wheels do not require power (e.g. during downhill travel). The model considers look ahead information on the next downhill travel and the total battery charge during the next downhill travel. The architecture consists of two main components: the vehicle model and the reinforcement learning (RL) module. The RL module receives observations from the vehicle model at each time instance and outputs the power split ratio between the FCS and electric motor. Based on this ratio, the vehicle model calculates the power from the FCS and the electric motor at every time point of the trip (Fig. 1).

Fig. 1.
figure 1

Training architecture. The agent forms a deep reinforcement learning algorithm, which takes observations from the vehicle model as inputs and outputs the action. The environment sends a step/reset command along with the action to the vehicle model. The vehicle model integrates the next step’s power requirement and the current power split observations based on the action and sends it back to the environment. The environment checks for termination/truncation of the episode based on the observations from the vehicle model. The environment then calculates the reward, which is used to update the agent’s network. Additionally, the environment normalizes the observations and feeds them back into the network.

3.1 Vehicle

The vehicle model consist of two sub modules- vehicle configuration and road data. Vehicle configuration consist of vehicle mass, FCS configuration, battery configuration, auxiliary power and vehicle dynamics including acceleration, deceleration, velocity, traction power and driving resistance. Road data consist of road slope, curvature, speed limits etc. which are captured using sensors on vehicle while driving along 8 different routes in Austria. Driving mode is a variable in the range (0,1) representing economy to aggressive driving. From the road data, vehicle dynamics and the driving mode, the required power for the vehicle at each instant of time is calculated. When the RL module sends the power split ratio as an action through the environment module, the vehicle model calculates the power from FCS as (total power * power split ratio) and power from electric motor as (total power * (1-power split ratio)). The SoC expenditure from the battery for the power from electric motor is calculated according to the battery configuration and dynamics. Subsequently, the remaining SoC is calculated as the difference between the current SoC and the SoC expenditure for the power. The FCS fuel consumption and efficiency are then determined from the FCS model. When the total power goes negative, the battery is charged and FCS power is kept at zero.

3.2 Deep RL Module

This is an episodic task involving finding the optimal power split at any given time point of truck’s trip.

Observation Space: The observations space consist of the total required power at the wheels at the moment in time, the battery SoC at the moment, altitude and curvature of the road, desired velocity, time steps to episode end, time steps till next downhill descent and total SoC charge in the next descent. The total required power is a continuous value in kW and is derived from the vehicle and route data. The battery SoC is also a continuous value ranging from 0 to 1.

Action Space: The action space is the power split ratio between FCS and electric motor which is a continuous value in the range [0,1].

Algorithm:With the episodic task setting with continuous action space, Soft Actor Critic (SAC) [3] algorithm best suits the problem. SAC is an off-policy actor critic deep reinforcement algorithm suited for continuous action space. Off-policy algorithms are sample efficient algorithms which uses past experience gathered in a replay buffer for learning. This suits our scenario where the route and vehicle dynamics are constants and environment exploration is bounded.

Reward: Table 1 shows the observations used for reward calculation. A reward of 100 is given when the agent navigates successfully to the end of the episode without SoC drain. A small reward is given for every step towards the episode end which promotes saving SoC since the episode gets terminated when the SoC goes to 0. A negative reward is given for cumulative H2 consumption during the travel with the objective of reducing H2 consumption. A penalty is given when the SoC goes below 10%.

Table 1. Reward components, positive/negative (+/-) contribution and their weights.

4 Experiment

4.1 Setup

Building on the methodology and tool chain by Bukic et al. [1] created around Ray [5], an open-source distributed computing framework, we trained the network with 16 CPUs and 1 GPU. A vehicle model with 40 ton weight, initial battery SoC of 0.5 and driving style of 0.5 were used for training. The driving style effect the acceleration, deceleration and velocity calculation and subsequently the required power at the wheels. Total of 8 routes were used for training. Policy and Q value network used 256,256 fully connected network with a learning rate of 0.003 for both the networks. A prioritized replay buffer with a capacity of 10,00,000 was used and a train batch size of 512 was used. Training was done for 50,000 iterations.

4.2 Results

Figure 2 shows the power split between FCS and electric motor using SAC method on the Brenner route. The approach splits the total power almost equally. The battery charges when the total power goes negative. The Brenner route is a high uphill/downhill route.

Fig. 2.
figure 2

Power split by Soft Actor Critic(SAC) method on the Brenner route. P_total is the total required power, P_FCS is the power from FCS and P_Bat is the power from battery(electric motor)

Figure 3 shows the battery SoC comparison between SAC and QO approaches on Brenner route. The red line shows the altitude of the route. Table 2 shows the comparison of H2 consumption with SAC power split method and Quadratic Optimization method on the 8 routes. On all routes, SAC approach has shown less H2 consumption with an highest improvement of 6% and the lowest improvement of 1.4 % compared to quadratic optimization approach.

Fig. 3.
figure 3

Battery SoC with Soft Actor Critic(SAC) method and Quadratic Optimization(QO) method on the Brenner route against the altitude of the road.

Table 2. Drive cycles, distance covered and H2 consumption with SAC power split approach and quadratic optimization(QO) approach from Ferrara et al. [2]

4.3 Conclusion

RL based power split strategy has demonstrated its effectiveness in reducing fuel consumption in hybrid vehicles. The above approach relies on offline predetermined route information and ideal velocity profile to calculate the power requirement. However, enhancing this strategy by predicting real-time power requirements using sensor data from the moving vehicle will improve its applicability and accuracy in real-world scenarios. Additionally, incorporating battery health parameters into the model will yield more sustainable battery performance. The current experiment is limited to vehicles of a specific weight.