Roadmap and challenges for reinforcement learning control in railway virtual coupling

The ever increasing demand in passenger and freight transportation is leading to the saturation of railway network capacity. Virtual Coupling (VC) has been proposed within the European Horizon 2020 Shift2Rail Joint Undertaking as a potential solution to address this problem. It allows to dynamically connect two or more trains in a single convoy, thus reducing headway between them. In this work, we investigate the main challenges related to the potential deployment of VC in railways. Its feasibility through Reinforcement Learning techniques is explored, discussing about technical implementation and performance issues. A qualitative analysis based on a Deep Deterministic Policy Gradient control algorithm is proposed. The aim is to give a first insight towards the definition of a qualitative and technology roadmap which could lead to the deployment of artificial intelligence applications aiming at enhancing rail safety and automation.


Introduction
The railway network is going to approach its capacity limits, especially on highly frequented lines, due to the increasing demand in both passenger and freight transportation. As a consequence, a lack of flexibility within the railway operations is emerging, with delays and overcrowding for passengers trains, or inefficiencies of transportation capacities for freight trains. An extension of railway lines is not always possible due to the lack of space, high costs, and long times related to the building of additional infrastructure.
To overcome these limitations, the Shift2Rail MOVINGRAIL [1] and X2RAIL3 [2] programs proposed a novel paradigm in train control systems based on the concept of Virtual Coupling (VC). Its primary objectives are improving infrastructure utilisation and increasing capacity of existing railway lines. The idea, strictly related to the platooning concept in the automotive field, is to virtually couple Autonomous Trains (ATs) via Train-to-Train (T2T) communication, so that they can travel in formation with the same velocity while maintaining a desired inter-train safety distance among them [3]. In this operative scenario, each train shares information via T2T communication networks with its neighbours and the external infrastructure, e.g., the Radio Block Center (RBC) or the interlocking signalling, representing the trackside controller responsible for train separation [4]. On the basis of these shared information, each train, equipped with an on-board control system, should be able to automatically adapt its motion guaranteeing the tracking of a desired reference profile and maintaining a safe inter-train safety distance with respect to the preceding train [5].
Of course, there are many open challenges to be solved to assess the effectiveness of railway VC (political, economical, safety evaluation, etc.). Among them, the interest of the scientific community is focusing on the potential use of AI techniques for its implementation, see for instance the european Shift2Rail RAILS program [6], which is focusing on the identification pf roadmaps for AI integration in the rail sector, also exploiting transferability from other transportation sectors. In this context, the aim of this work is twofold: i) investigating how and if the automotive platooning concept can be transferred to the railway domain; ii) exploring the feasibility of VC in railways from a control viewpoint through AI techniques. To this end, we provide a qualitative analysis on the exploitation of AI methods in order to derive a technical roadmap for the future VC deployment.

Description and background
The most common national and international railway systems currently applied and in development are fixed blocks (ETCS Level 1 and 2 and most national signalling systems) and moving blocks (ETCS Level 3). As stated in [2], both fixed and moving blocks approaches involve limitations to the potential line capacity, due to the Absolute Braking Distance (ABD) supervision: each train takes into consideration only its own braking characteristics to determine its permitted speed. As a consequence, the following trains are kept at a distance that is unnecessarily high, compared to the actual safety possibilities.
VC goes beyond the concept of moving block railway operations introduced by ETCS Level 3, allowing trains to be separated by a relative braking distance (RBD) rather than by an ABD. The concept of Virtually Coupled Train Sets (VCTS) implies the paradigm called braking the walls: from absolute to relative braking distance (see Fig. 1).
It relies on a mutual exchange of relevant information between two or more following trains via T2T communication, in order to allow them to run at closer distance than the ABD of the rear consist. Trains running as virtually coupled should be capable to share each other the set of data regarding their specific dynamic behaviour. In this way, the VCTS should be able to compute cooperative braking curves, which integrate the parameters related to the braking characteristics and the status of the consists, as well as other parameters such as communication network delays, exogenous factors, and so on. As a result, consists inside the platoon could be allowed to move as a convoy with a safe distance much lower than the braking distance needed for a full stop [8]. Given the relative short distances between trains, VC also implies trains to be automatically driven by an Automatic Train Operation (ATO) to substantially reduce sight and reaction times of human drivers which would be unsafely long for this kind of operations. VC paradigm could bring a broader set of advantages over the traditional way to operate a railway network, such as, increase of the line capacity by reducing headway, increase of operational flexibility by ensuring interoperability, costs reduction, increase of competitiveness by making more efficient goods and passengers transportation with respect to road transportation [2].
Despite its supposed benefits, the deployment of VCTS still requires a clearer understanding of its implications in term of feasibility, safety and actual capacity benefits. Research carried out so far in railway VC provides preliminary results and points out open challenges and critical issues. For example, [9] proposed a systematic classification and operational potentials of train coupling; [10] introduced preliminary operational concepts for VC by defining an extended blocking-time model for comparing capacity occupation of VC with ETCS L3 moving-block operation. In [11], a train-following model to describe train operations under VC has been developed along with an assessment of the capacity performances under different operational 1 3 settings. Capacity is measured in terms of space separation and time headway between consecutive trains. Simulation experiments show that VC significantly reduces train separation and time headway, for all the appraised scenarios, compared to ETCS Level 2 and Level 3. This represents a promising result since VC could actually provide significant capacity improvements with respect to current railway systems.
The implications VC in ERTMS/ETCS operational scenarios have also been investigated in [8], where the main advantages, current obstacles and future developments for the effective implementation of VC within the ERTMS standard specification have been introduced. Again, [3] has proposed a proof of concept by introducing a specific operating mode within the ERTMS/ETCS standard, and a coupling control algorithm accounting for time-varying delays affecting the communication links has been addressed. The identification of technical performance requirements for VC communication (direct T2T over short and long distances, low latency, etc.) has been addressed in [1]. Specifically, it discloses that the communication architecture for VC should be based around 5G principles with a cellular network connection for long distance communication and a Peer to Peer direct link similar to IEEE802.11 (Wi-Fi), but fully integrated into 5G, for short range communication.
Available studies have only partially touched upon the effects of risk factors on safety and capacity. [12] proposed a SWOT analysis for different railway market segments to assess feasibility of VC and investigate the applicability of such a concept, pointing out the safety, operational, and technological challenges that need to be carefully addressed. In addition, [8] and [3] highlighted the need of investigating the safety-related impact of VC against hazardous scenarios and critical failures. In this direction, [13] introduced the concept of dynamic safety margin for VCTS to dynamically adjust train separations so to always keep required safety distances when hazardous operational events occur. The aim is to take into account relevant risk factors in real-life operations like: T2T communication delays, extended driving reaction times, train positioning errors and emergency braking applications. Indeed, when considering those factors, the train separation under VC needs to be increased by additional safety margins to remove any safety risks arising from their individual or combined presence.
Some devices for rail-bound vehicles able to regulate the speed of one vehicle in order to couple with the preceeding one has also been proposed, see for instance [14].
Although the VC concept has been widely explored from conceptual point of view or via simulations, there are still few works so far focused on the development of suitable control systems to be equipped on board. About this gap, for instance, interesting results are presented in [15] (and references therein), where a VC controller has been proposed which accounts for T2T communication topologies and different maneuvers of the trains composing the platoon; in [11], the authors capture multiple states and their VC train operation transitions; safety issues are analysed in [16], providing also several monitoring methods. Instead, based on the developments in platooning of autonomous vehicles, different control strategies have been proposed in the technical literature for realising railway VC. Specifically, an optimal Model Predictive Control approach has been leveraged in [17] to guarantee a safe distance between two consecutive trains under state/input constraints, or in [18] for metro lines. A Distributed Model Predictive Control (DMPC) is suggested in [19] to consider the coupled constraint of safety braking distance and the individual constraints of speed limit variations and restricted traction/braking performance.
In order to give an answer to the potential application of AI techniques in railway VC, it is worth noting that, besides the model-based control techniques mentioned above, model-free and Deep Reinforcement Learning (DRL) controllers are spreading in ITS field. First attempts in the cooperative driving of autonomous connected vehicles can be found in [20,21], where platooning problem is solved via a Deep Deterministic Policy Gradient (DDPG) and a Hybrid DRL approach, respectively. With respect to railway systems, Deliverable 2.1 of Shift2Rail RAILS project [22] proposed a comprehensive analysis of transferability of AI techniques from other transportation sectors (i.e., automotive). In this direction, DRL strategies are explored, e.g., for improving the train timetable rescheduling/routing in [23] or for controlling a single train in energy-efficient way in [24]. However, to the best of our knowledge, the possibility of applying the DRL concept for VCTS paradigm is not investigated yet.

Objectives and research questions
The VCTS system should be able to provide various functionalities which can be grouped into different classes. In the conceptual architecture proposed in [2], these classes are organised in a vertical four-layer structure, characterised by distinct levels of abstraction: from a macroscopic view of the whole railway network to the microscopic movements of single convoys. The railway VC functionality, as well as the related control strategies, belongs to the tactical layer which is in charge of managing the VCTS by defining the speed and acceleration targets, as well as the headway between trains. As a consequence, line capacity strongly depends on the effectiveness of the tactical layer functionalities.
The main objectives of VC paradigm are the improvement of infrastructure utilisation, the increment of the existing railway lines capacity (i.e. reduction of trip times, headway, etc.) with respect to current systems, as well as the increment of flexibility by operating heterogeneous trains platooning. In an ideal driving scenario where there are no communication impairments, uncertainties and inaccuracies, VC objectives would be optimised, as the RBD would be zero. Indeed, in this case, trains could travel at the same speed, and thus follow the same braking curve never changing the distance between them [25]. However, in reality, railway environments are characterised by multiple factors of uncertainties that cause deviation from this ideal behaviour, hence requiring additional safety margins and thus affecting the efficiency of VC. These can be: • reaction delay, that is, the elapsed time between brake application of two trains; • latencies in T2T communication; • heterogeneity in trains dynamics since each train can have different operational performances (e.g., different braking capabilities, different speed categories); • trains variable mass (e.g., freight trains); • track conditions (e.g., adhesion factors, gradients); • exogenous factors (e.g., weather conditions); • uncertainties in train location information.
In an attempt to define the magnitudes of the uncertainties of the railway environment, we highlight that some of them can be provided by the manufacturer, such as those related to train mass, breaking capabilities, speed categories, and reaction delays, while uncertaintes in track conditions, such as slope and curvature radius, depend on the characteristics of the railway line and can be provided by the railway line infrastructure manufacturer. Besides that, uncertainties such as weather conditions introduce uncertainties in the adhesion factors [26], as well as external disturbances as the wind are highly dependant on the geolocalization of the railway line under consideration. Eventually, since the VC communication infrastructure is still not available, latencies in T2T communications cannot be defined yet.
Therefore, the performances of VC could be strongly degraded due to the effects of all the uncertainties characterising real-life railway scenarios.
In view of all the considerations made above, some research questions arise: • RQ1 How and what of vehicle platooning can be transferred to railway VC? • RQ2 Can AI methods be exploited to ensure the effectiveness of the tactical layer functionalities both in nominal and uncertain environments? • RQ3 Can we address possible answers to RQ1 and RQ2 through a proof of concept in order to inspire future developments and a technology roadmap?
In order to give an answer to the above questions, and thus to evaluate the effectiveness of VCTS paradigm, some criteria should be considered. In particular, some risk assessment indexes could be transferred from vehicle platooning, as, for instance, time headway and time to collision, which are two typical rear-end collision indexes [27]. Furthermore, other performance indicators, already exploited in automotive field, could be transferred to the railway domain such as energy consumption [28] or tracking error [29]. Indeed, as already pointed out in the previous section, some of the first studies on VCTS (see for instance [11]) actually exploits some Key Performance Indexes (KPIs) which are well established in automotive. In view of this, in Table 1 we propose some KPIs derived from automotive field which could be well adapted and transferred to the railway sector, and, hence, can be used to evaluate the performances of railway VC.

RL for VC: from automotive to railway
In the automotive field, most of the current autonomous driving decision making systems are focusing on model-based approaches, which requires a prior knowledge of the environment and vehicle characteristics to manually design a driving policy. Thus, traditional controllers make use of an a priori model composed of fixed parameters. However, in complex environments, they cannot foresee every possible situation that the system has to cope with.
Recent advances in Machine Learning (ML) enables the possibility for learning based approaches in autonomous driving decision making. Learning controllers make use of training information to learn their models over time. With every gathered batch of training data, the approximation of the true system model becomes more accurate, thus enabling model flexibility and uncertainty estimates through Gaussian processes [30].
Among ML methods, the supervised and unsupervised learning exploit defined and fixed training datasets [31]; conversely, RL methods work in high-dimensional, continuous spaces [32], and thus are more suitable to face notably physical control tasks, such as railway VC.
Specifically, the RL-methods are divided into 3 macro-methods: i) the value based methods, which are exploited for the decision making problem; ii) the policy based methods, which allow to obtain a continuous control action, but suffer different issues in term of training times and convergence; iii) the actor-critic methods, which are spreading in the control field [33].
In this direction, as highlighted in Deliverable D2.1 of RAILS WP2 [34], still few contributions regarding AI approaches for vehicle platooning have been proposed so far; however, they show promising results when specific actor-critic RL methods are exploited. Namely, [35] proposed a RL optimal controller for the vehicle platooning problem based on a DDPG algorithm. The results have been compared with a conventional Model Predictive Control (MPC) strategy and simulations confirm how the RL controller outperform the MPC in terms of computational time and control effort, especially in more realistic and complex scenarios, while maintaining similar root mean square error in the inter-vehicle distances. Again, [36] compares DRL based on DDPG and traditional MPC for Adaptive Cruise Control (ACC) design in car-following scenarios. Simulations show that, when there are no modelling errors and the testing inputs are within the training data range, the DRL solution is equivalent to MPC with a sufficiently long prediction horizon. The DRL control performance degrades when the testing inputs are outside the training data range; however, when there are modelling errors due to control delay, disturbances, or uncertainties, the DRL-trained policy performs better when the modelling errors are large while having similar performances as MPC when the modelling errors are small.
From the results in automotive, it emerges that RL methods based on DDPG can outperform, especially in complex and uncertain scenarios, conventional model-based approaches. Since the railway environment also includes so many factors and uncertainties which could affect the performances of railway VC, it seems that DDPG-based RL approaches could be preferred to conventional model-based control algorithms for the deployment of the tactical layer functionalities, hence addressing RQ1. Namely, they could be capable to adapt the optimal solution in real-time considering all current uncertainties and contingencies without any prior knowledge of the railway environment (see RQ2). In the view of these considerations, we propose a proof of concept to explore the technical feasibility of railway VC through RL methods based on DDPG. To the best of our knowledge, no AI solutions have been addressed so far to tackle with VC control strategies in railways. Hence, we define a possible technology roadmap for VCTS feasibility, therefore addressing RQ3.

Model-free DDPG control for VCTS
On the basis of the considerations made above, we aim at investigating how the tactical layer functionalities could be deployed through a RL control algorithm based on DDPG. The goal of the controller should be taking into account the main objectives of the VCTS paradigm, such as, enhancing safety and improving railway capacity and performances at the same time, without any prior knowledge of the surrounding environment.
To this aim, DDPG is a model-free actor-critic method able to learn competitive policy in a continuous action space using states in the designed observation space [32]. In other words, DDPG allows to train an agent interacting with an unknown environment via a learn-by-doing process: the RL agent learns, through trial and error, the best way to accomplish a task, so that the trained agent becomes able to adapt and react to the surrounding environment. The actor-critic paradigm exploits two different RL methods: i) the value-based RL approach, based on Deep Q-learning Network (DQN), which approximates the nonlinear observation state-action value and allows the evaluation of the quality of the chosen action; ii) a policy method based on Deep Policy Gradient (DPG), which bypasses the evaluation of the quality value via a direct estimation of the relation between the observation state and the action.
As depicted in Fig. 2, the typical architecture of a DDPG-based control algorithm is composed of four different Deep Neural Networks (DNN), namely: actor, critic, target actor and target critic. The actor (DPG-based) aims at computing the control inputs to be imposed to trains via the estimation of the competitive policy, while the critic (DQN-based) critics the action chosen by the actor. The target networks are exploited to stabilise the training process and present the same structure of the corresponding main networks. Indeed, to guarantee that a RL agent learn a given task, a training process is required. This is based on a reward function, which gives an evaluation of the quality of the chosen action. In particular, the policy inherited in the agent itself is updated by maximising a cumulative reward obtained via the interactions with the environment. Hence, the reward is the feedback that carries the information of the adequacy of the performance based on the agent's policy.
In view of this, in according to the technical literature [19,37], a decentralized DDPG-based VCTS could be leveraged to solve the tactical layer problem (see Fig. 3 for a conceptual view of a DDPG-based approach for VCTS). To this aim, the following features should be properly defined: the action space, the observation space, and the reward function. Specifically, the continuous action space represents the control input, e.g., the acceleration/deceleration of the consists at each time step. The observation space is composed of all the observable states which allow a perception of the scenario. In designing the observation space, the following information should be considered: the desired reference coming from RBC through T2I communication, train positions and velocities coming from T2T communication (according to the communication topology), some train characteristics, such as train length, the energy consumption, and so on.
An accurate shaping of the reward function is necessary to take into account the main objectives of the VCTS paradigm. In particular, in order to enhance safety and improve railway capacity and performances at the same time, the reward function could take into account some of the risk assessment KPIs defined in Table 1, such as time headway or time to collision, as well as other objectives, such as energy consumption, or even comfort (for an RL-based ACC controller, a suitable reward function has been proposed in [38]). In other words, if the reward function were build with a proper combination of the aforementioned KPIs, maximising the reward would accordingly mean trying to maximise VCTS safety and railway capacity at the same time. On the other hand, the convergence of the DDPG algorithm to the optimal solution is strictly related to the accuracy of the reward function itself: the more the objectives considered in the reward function, the more sub-optimal the solutions that could be achieved. An accurate learning process should allow the RL agent to learn how to react to the unknown surrounding environment, which embeds the unpredictable and uncertain factors characterising railway dynamic and complex scenarios, such as track conditions, exogenous factors, and so on. Note that the consists themselves and their operational performances are embedded in the environment. This means that the proposed approach could allow the coupling of trains with different operational capabilities, ensuring platooning among heterogeneous trains while absorbing uncertainties arising in real driving conditions.
It is worth highlighting that, in order to ensure the effectiveness of VCTS, the technical performance requirements for T2T communications (direct T2T over short and long distances, low latency, etc.) should be satisfied. As emerged from [1], the low latencies ensured by 5G technology could meet the T2T communication requirements for VCTS.

Training and validation
As already pointed out, differently from other ML algorithms, as the supervised and unsupervised learning, which exploit defined and fixed training datasets [31], the DRL methods work in high-dimensional, continuous action spaces, to deal with notably physical control tasks [32]. Such large action spaces are difficult to explore efficiently, and any datasets would be inadequate for this purpose. That is why, as also emerges from the automotive field, (see Deliverable D2.1 of RAILS WP2 [34]), simulators for virtual validation and training are required. There are several advantages of using simulators as a training tool for RL. The first one is that they can afford many more samples, since simulations can be significantly faster and cheaper than the real experiments. The second one is safety, which cannot be guaranteed in real scenarios for trial-and-error learning of RL, especially in worst case situations [39]. Thus, a preliminary training phase of the RL agent is required in order to teach it how to react, especially in dangerous scenarios, such as collisions. This phase can be conducted through the use of simulators which allow to safely simulate worst case scenarios. Then, by leveraging the transfer learning approach [40], the agent's performances can be further improved with a more advanced training in a real environment, also leveraging human experience in a completely safe manner.
To the best of our knowledge, even though some simulators are currently available for railway virtual testing and RL training (see for instance SUMO and Anylogic), they do not allow the possibility of considering the VCTS paradigm, which also requires to take into account T2T communications. Otherwise, a simulator such as that proposed in [11] is an ad-hoc solution to properly validate the control protocol suggested by the authors in specific scenarios.
This represents a critical aspect for the potential exploitation of RL methods and the deployment of railway VC via DRL method. In view of this, an advanced ad-hoc simulator should be provided for the validation and the training of VCTS paradigm. This simulator should take into account different operative scenarios, train characteristics, track conditions, exogenous factors, to be exploited during the training phase. Note that there is still a gap to be filled between the accuracy of a simulated world and real-world scenarios [41], and this could affect the performances of learning systems.

Expected results and concluding remarks
To assess the feasibility of RL-based VCTS in different operational scenarios, also considering uncertain environments, heterogeneous train set and exogenous factors, it is possible leveraging the KPIs proposed in Table 1. Furthermore, they can be exploited for evaluating the performances of the proposed RL approach comparing them with the ones achievable via traditional model-based control strategy, or those ensured by current fixed and moving block railway systems. As it follows from the automotive field, it is expected that the performances of a DDPG-based RL algorithm could overcome the limitations of VC model-based approaches, especially when considering uncertain environments.
Possible criticalities in using DDPG-based RL methods for the deployment of VTCS paradigm could be represented by: • the shaping of the reward function: the designing of suitable reward functions can improve the performance of the proposed control strategy. It could be achieved, e.g., after testing a variety of reward combinations [42]. • the training process: a suitable railway simulator should be provided for the validation and the training phase, since the robustness/resilience of the proposed RL method to the unmodelled and uncertain factors affecting the railway environment can be improved via a correct and deep training process.
In conclusion, it is worth highlighting how the proposed methodology is very challenging, since the usage of AI models in safety-critical systems is still an open debate and some efforts are carrying out to bring AI and functional safety closer to each other [41]. In this direction, this proof of concept is intended to give a first insight towards the definition of a qualitative and technology roadmap which could lead to the deployment of AI applications for the enhancement of railway systems safety and their grade of automation.