POMDP inference and robust solution via deep reinforcement learning: An application to railway optimal maintenance

Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the lack of availability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), require the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of transformers and long short-term memory networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.


Introduction
Partially Observable Markov Decision Processes (POMDPs) offer a mathematically sound framework to model and solve complex sequential decision-making problems (Cassandra, 1998).POMDPs account for the uncertainty associated with observations in order to derive optimal policies, namely a sequence of optimal decisions that minimize/maximize the total costs/rewards over a prescribed time horizon, under stochastic and uncertain environments.
Stochasticity can indeed be incorporated both in the evolution of the hidden states over time, i.e., the transition dynamics, and in the process that generates the observations, which reflect only a partial and/or noisy information of the actual states.
POMDPs form a potent mathematical framework to model optimal maintenance planning for deteriorating engineered systems (Papakonstantinou & Shinozuka, 2014a).In such problems, a perfect information of the system's condition (state) is generally not available or feasible to acquire, due to the problem's scale, inherent noise of sensing instruments, and associated costs limitations.By using sensors and inferred associated condition indicators, Structural Health Monitoring (SHM) tools, as described by Andriotis, Papakonstantinou, and Chatzi (2021); Farrar and Worden (2012); Straub et al. (2017), can provide estimates of the structural state.However, the provided observations are often incomplete and susceptible to noise, which limits their ability to accurately determine the true state of the system.Consequently, decision-making must occur in the face of irreducible uncertainty.Within a POMDP scheme, the decision maker (or agent) receives an observation from an SHM system, using it to form a belief about the current state of the system.Based on this belief, the agent takes an action, which will impact the future condition of the system.The POMDP objective is to find the optimal sequence of maintenance actions that minimizes the expected total costs over the operating life-cycle.A list of applications of POMDP modeling to optimal maintenance can be found in (Durango & Madanat, 2002;Ellis, Jiang, & Corotis, 1995;Kıvanç, Özgür-Ünlüakın, & Bilgiç, 2022;Madanat & Ben-Akiva, 1994;Memarzadeh, Pozzi, & Zico Kolter, 2015;Papakonstantinou, Andriotis, & Shinozuka, 2018;Schöbi & Chatzi, 2016).
POMDP solutions assume knowledge of the transition dynamics and the observation generating process.This implies strict prior assumptions on the POMDP model parameters that govern the deterioration, the effects of maintenance actions, and the relation of observations to latent states and variables.
When a POMDP model is available, the solution can be computed via Dynamic Programming (DP) (Bertsekas, 2012) and approximate methods (Papakonstantinou & Shinozuka, 2014b) with optimality convergence guarantees, when the complexity of the problem is not prohibitive, or via Reinforcement Learning (RL) schemes (Sutton & Barto, 2018) through samples and trial and error learning.While RL methods can relax some assumptions on the POMDP knowledge, a simulator that can reliably describe the POMDP model is still necessary for inference and testing purposes, particularly for engineering problems and in infrastructure asset management applications.
However, a full POMDP model of the problem is rarely available in realworld applications, and its inference can be quite challenging.The availability of such a model is a key issue that prevents wide adoption of the POMDP framework and its solution methods (e.g., reinforcement learning) for realworld applications.Available literature on the theme of maintenance planning is focused on developing RL methods to solve complex POMDP problems, as pioneered by the work of Andriotis andPapakonstantinou (2019, 2021), while assuming knowledge of the POMDP transition and observation models, i.e., by for example assuming that the POMDP inference has already been carried out.Only few papers deal with the POMDP inference, which poses a challenge in itself, while best practices are not generally available.Papakonstantinou and Shinozuka (2014a); Song, Zhang, Shafieezadeh, and Xiao (2022); Wari, Zhu, and Lim (2023) propose methods to estimate the state transition probability matrix for deterioration processes, but without demonstrating inference on the transition matrices associated with maintenance actions.Guo and Liang (2022) propose methods to estimate both the transition and the observation models, but do not consider model uncertainty and the implementation examples do not involve real-world data but only simulated ones.
In Arcieri et al. (2023), we tackle this key inference issue by proposing a framework to jointly infer all transition and observation model parameters entirely from available real-world data, via Markov Chain Monte Carlo (MCMC) sampling of a Hidden Markov Model (HMM), which is conditioned on actions.The framework, which is relatively easy to implement and can be tailored to the problem at hand, estimates full posterior distributions of POMDP model parameters.By considering these distributions in the POMDP evaluation, optimal policies that are robust with respect to POMDP model uncertainties are obtained.
In this work, we combine the POMDP inference with a deep RL solution.
Most previous works on deep RL methods focus on fully observable problems, with RL solutions for POMDPs having received notably lower attention.Partial observability is usually overcome with deep learning architectures that are able to infer hidden states through memory and a history of past observations.Schmidhuber (1990) is one of first works that applied Recurrent Neural Networks (RNNs) for RL problems.Subsequently, Long Short-Term Memory (LSTM) networks have become the standard to handle partial observability (Dung, Komeda, & Takagi, 2008;Meng, Gorbet, & Kulić, 2021;Zhu, Li, Poupart, & Miao, 2017).Recent works propose to replace LSTM architectures with Transformers (GTrXL) (Parisotto et al., 2020).A third modeling option, which constitutes a hybrid approach between a DP and a RL solution, exploits the POMDP model to compute beliefs via Bayes' theorem, which are then fed to the deep RL algorithm as inputs to classical feed-forward Neural Networks (NNs) (Andriotis & Papakonstantinou, 2019, 2021;Morato, Andriotis, Papakonstantinou, & Rigo, 2023).Namely, the POMDP problem is converted into the belief-MDP (Andriotis et al., 2021;Papakonstantinou & Shinozuka, 2014b) and then solved with deep RL techniques.We compare these three available solution methods and propose a joint framework of inference and robust solution of POMDPs based on deep RL techniques, by combining MCMC inference with domain randomization of the RL environment in order to incorporate model uncertainty into the policy learning.
We showcase the applicability of these methods and of the proposed framework on a real-world problem of optimal maintenance planning for railway infrastructure.The problem, modelled as a POMDP, is based on on-board railway monitoring data, namely the so-called "fractal values" condition indicator, computed from field measurements and provided by our SBB (the Swiss Federal Railways) partners.
The remainder of this paper is organized as follows.Section 2 provides the necessary background on POMDPs.Section 3 describes the considered maintenance planning problem of railway assets and the monitoring data.Section 4 describes the POMDP inference and its implementation to the problem here considered.Section 5 evaluates the three available modeling options of deep RL solutions for POMDPs, namely LSTM, GTrXL, and the belief-input case.
Section 6 proposes our joint framework of POMDP inference and robust solution via deep RL and domain randomization.Finally, Section 7 concludes with a highlight and a discussion of the contributions, and outlines possible future work.

Partially Observable Markov Decision Processes
A POMDP can be considered as a generalization of a Markov Decision Process (MDP) for modelling sequential decision-making problems within a stochastic control setting, with uncertainty incorporated into the observations.A POMDP is defined by the tuple S, A, Z, R, T, O, b 0 , H, γ , where: • S is the finite set of hidden states that the environment can assume.
• A is the finite set of available actions.
• Z is the set of possible observations, generated by the hidden states and executed actions, which provide partial and/or noisy information about the actual state of the system.
• R : S × A → R is the reward function that assigns the reward r t = R(s t , a t ) for assuming an action a t at state s t .
• T : S × S × A → [0, 1] is the transition dynamics model that describes the probability p(s t+1 | s t , a t ) to transition to state s t+1 if action a t is taken at state s t .
• O : S × A × Z → R is the observation generating process that defines the emission probability p(z t | s t , a t−1 , z t−1 ), namely the likelihood to observe z t if the system is at state s t and action a t−1 was taken.
• b 0 is the initial belief on the system's state s 0 .
• H is the considered horizon of the problem, which can be finite or infinite.
• γ is the discount factor that discounts future rewards to obtain the present value.
In the POMDP setting, the agent takes a decision based on a formulated belief over the system's state.Such a belief is defined as a probability distribution over S, which maps the discrete finite set of states into a continuous (Papakonstantinou & Shinozuka, 2014b).It is a sufficient statistics over the complete history of actions and observations.Solving a POMDP is thus equivalent to solving a continuous state MDP defined over the belief space, termed the belief-MDP (Andriotis et al., 2021;Papakonstantinou & Shinozuka, 2014b).The belief over the system's state is updated according to Bayes' rule every time the agent receives a new observation: where the denominator is the normalizing factor: The objective of the POMDP is to determine the optimal policy π * , which maps beliefs to actions, that maximizes the expected sum of rewards: where r t = R(s t , π(b t )).Algorithms based on DP (Bertsekas, 2012) can be used to compute the optimal policy.These algorithms rely on two key functions: the value function V π , which calculates the expected sum of rewards for a policy π starting from a given state until the end of the prescribed horizon, and the Sutton & Barto, 2018), which estimates the expected value for assuming action a t in state s t , and then following policy π.
Finally, a POMDP can be represented as a special case of influence diagrams (Luque & Straub, 2019;Morato, Papakonstantinou, Andriotis, Nielsen, & Rigo, 2022), which form a class of probabilistic graphical models.Figure 1 illustrates the influence diagram for the POMDP here considered.Circles and rectangles correspond to random and decision variables, respectively, while diamonds correspond to utility functions (Koller & Friedman, 2009).Shaded shapes denote observed variables, while edges encode the dependence structure among variables.

The railway maintenance problem
We apply and test the proposed methodology on the problem of optimal maintenance planning for railway infrastructure assets on the basis of availability of regularly acquired monitoring data.The railway track comprises various components, such as rails, sleepers, and ballast, which are exposed to harsh environments and high operating loads, leading to accelerated degradation.
Among these infrastructure components, the substructure -in particular -is especially important in this degradation process.The substructure undergoes repeated loading from the superstructure (tracks, sleepers and ballast), prevents soil particles from rising into the ballast, and facilitates water drainage.
A weakened substructure typically results in distortions of the track geometry.Tamping (Audley & Andrews, 2013), a maintenance procedure that uses machines to compact the ballast underneath the railway track, restoring its shape, stability and drainage system, is often applied when the substructure condition is considered moderately deteriorated.However, in case of poor substructure condition, such as intrusion of clay or mud or water clogging, tamping provides only a short-term remedy, and replacing the superstructure and substructure is the most appropriate long-term solution.POMDP inference and robust solution via deep reinforcement learning The optimization of maintenance decisions for these critical infrastructure components benefits from information that is additional to the practice of scheduled visual inspections, which are typically conducted on-site by experts.
Such additional information can be delivered from monitoring data derived by period.These logs contain information on the maintenance, repair, or renewal actions taken on a section of the network at a specific date.
We model the railway track maintenance optimization with a POMDP scheme, relying on diagnostic vehicle measurements of long-wave fractal values.
The true but unobserved railway condition is discretized in 4 hidden states,

POMDP inference
To formulate the POMDP problem, the transition dynamics and the observation generating process must be inferred.In the RL context, the POMDP inference is necessary to generate samples for the policy learning, for inference of a belief over the hidden states, and/or for testing purposes.To tackle this key issue, we propose an MCMC inference of a HMM conditioned on actions, which jointly estimates parameter distributions of both the POMDP transition and observation models based on available data.While we implement the proposed scheme on the problem of railway maintenance planning based on fractal value observations, its applicability is general.Therefore, we further suggest possible extensions to help researchers and practitioners tailor the POMDP model inference to the problem at hand.In addition, we provide a complementary tutorial 1 illustrating the code implementation on various simulated case-studies, in order to support exploitation for real-world applications.
1 Code available on GitHub.
In the context of discrete hidden states and actions, the transition dynamics are modelled via Dirichlet distributions: where T 0 are the parameters of the probability distribution of the initial state s 0 , and α 0 and α T are the prior concentration parameters.T 0 can be assigned a uniform flat prior α 0 , unless some prior knowledge on the initial state distribution is available.By contrast, it is beneficial to regularize T with informative priors α T , which regularize the deterioration or the repairing process.For example, the transition matrix related to the action do-nothing, which describes the deterioration process of the system, can be regularized with higher prior probabilities on the diagonal and on the upper-right triangle, and near-zero on the lower-left triangle.Likewise, the transition matrices associated with maintenance actions would present higher prior probabilities on the left triangle and near-zero on the right triangle, in order to inform the model that a repair action is expected to be followed by improvements of the system.
The dimensionality of the Dirichlet distribution that models the transition dynamics T is S × S × A, namely one transition matrix per action.The extension to time-dependent transition dynamics is straightforward by enlarging the distribution by a further dimension representing time, i.e., S × S × A × H.
In the context of continuous observations, the observation generating process can differ on the basis of whether the observation follows a deterioration or a repairing process.In addition, similarly to the inference of the first hidden state according to T 0 , an initial observation process can be necessary to model the first observation.Tailoring to the nature of the fractal value monitoring data, the initial, deterioration, and repairing processes are modelled via Truncated Student's t processes, as follows: where ub stands for "upper bound", and all parameters governing the processes are assigned priors described in Arcieri et al. (2023).
The The graphical model of the entire HMM is reported in Figure 2. The MCMC inference is run on a final dataset of 62 time-series with the No-U-Turn Sampler (NUTS) (Hoffman, Gelman, et al., 2014).Four chains are run with 3,000 samples collected per chain.The inference results, which present good post-inference diagnostic statistics, with no divergences and high homogeneity between and within chains, are reported in Figures A1-A6 in Appendix A.

RL for POMDP solution
POMDP problems have been tackled via deep RL with common methods augmented with LSTM architectures and a history of past observations (and possibly actions) as inputs (Meng et al., 2021;Zhu et al., 2017).1).The belief-MDP is then solved via classical deep model-free RL methods with feed-forward NNs (Andriotis & Papakonstantinou, 2019;Morato et al., 2023).We here compare the performance of the two model-free and the hybrid solution, referred to as "belief-input" case, on the real-world POMDP problem of railway maintenance planning that has been presented in Section 3, with parameter inference described in Section 4. While Parisotto et al.
(2020) demonstrate the superiority of Transformers over LSTMs on simulated tasks, our work offers a further comparison of the two methods, and confirms the superiority of the former, on a real-world stochastic (both in the transition dynamics and in the observation generating process), partially observable problem.
For this comparison we set the POMDP parameters to the mean values of the distributions reported in Appendix A, in order to evaluate the methods without model uncertainty, with the latter case tackled in the next section.For all modeling options, the policy is learned via the Proximal Policy Optimization (PPO) algorithm with clipped surrogate objective (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017).The overall evaluation algorithm is reported in pseudocode format in Algorithm 1.In addition, the code of the experiment is made available online 2 .We consider 50 time-steps, i.e., 25 years (1 time-step equals 6 months), as the decision horizon H of the problem, as discussed with our SBB partners.
For all methods, the policy networks are updated every 4,000 training timesteps.Every 5 updates, 500 evaluation episodes are run with different random seeds in order to average the results over the stochasticity of the environment.
In addition, the entire analysis is repeated a second time (with a different random seed) to further average the results over the stochasticity of the NN training.Grid-searches are performed over the hyperparameters for all methods and the selected values are reported in Table B1 in (Littman, Cassandra, & Kaelbling, 1995), which constitutes a POMDP solution based on DP, and which turns out to be an effective solution for the characteristics of this problem (Arcieri et al., 2023).The second option 2 Code available on GitHub.

21:
Update π φ with PPO and replay buffer D

23:
Run 500 policy evaluation episodes without exploration 24: end for is the optimal MDP solution, namely the optimal policy computed and evaluated on the underlying MDP, i.e., when the hidden states are fully observable.
The latter constitutes an upper bound to any POMDP solution, which cannot be exceeded, given the irreducible inherent uncertainty of the observations, and serves as a benchmarking reference.
The belief-input method outperforms the other two model-free RL solutions and already shows strong performance at the first evaluation iterations.
The method converges to the best policy within a few iterations, as reported in the zoomed-in view of the first 70 evaluation iterations reported in the lower left figure inset, matching the Q M DP method with few policy updates.Because Evaluation iteration Total costs Fig. 3: Comparison of the performance of LSTM (green), GTrXL (orange), and the belief-input case (blue) over 250 evaluation iterations.At every iteration, 500 trial episodes are evaluated with different random seeds and the average results are returned.The entire analysis is repeated for a second random seed and the average performance is plotted.An evaluation iteration is run after 5 policy updates and a policy update is performed every 4,000 training time-steps, for a total of 5 million time-steps.The performance is further benchmarked against the Q M DP method (dashed red) and the optimal MDP policy (dashed yellow).On the left corner, a zoomed-in plot of the belief-input performance over the first 70 evaluation iterations.
the number of training time-steps evaluated may not be sufficient for convergence of the other two model-free RL methods, we continue training up to 2,000 evaluation iterations (40 million training time-steps).This could however negatively impact the performance of the belief-input method, which already converged and may begin to suffer from overfitting.The extended training is reported in Figure 4, where a rolling average window of 5 steps is further applied for illustration purposes.
As expected, the performance of the belief-input method slightly decreases over time.The GTrXL is proven to deliver a better architecture than the LSTM for POMDP applications, also for this particular case of application on a realworld problem.The GTrXL is indeed less affected by variance and eventually converges to a better policy, albeit still far from the Q M DP benchmark and the best policy with the belief-input method.

Total costs
Fig. 4: Comparison of the performance of LSTM (green), GTrXL (orange), and the belief-input case (blue) over 2,000 evaluation iterations, for a total of 40 million training time-steps.The performance is further plotted with an average rolling window of 5 steps for displaying purposes.
Finally, for all three methods we saved the best models, which were evaluated during training and evaluated the learned policies over 100,000 trials.
The results are reported in Table 2 in terms of average performance, Standard Error (SE), best (Max) and worst (Min) trial.In the table, the belief-input case average performance is close but slightly worse than the Q M DP method.This is likely due to the fact that the best model was picked based on an average over 500 trials, which is still subject to a significant standard error.6 Domain randomization for robust solution Further to the challenge of POMDP inference, another key issue is the robustness of the deep RL solutions.RL methods generally learn an optimal policy by interacting with a simulator.When the trained RL agent is deployed to the real-world, the performance can deteriorate, or altogether fail, due to the "simulation-to-reality" gap (Salvato, Fenu, Medvet, & Pellegrino, 2021;Zhao, Queralta, & Westerlund, 2020), if the solution is not robust to model uncertainty.
In Arcieri et al. (2023), we propose a framework in combination with the POMDP inference to enhance the robustness of DP solutions to model uncertainty.Namely, the POMDP parameter distributions inferred via MCMC sampling are incorporated into the solution by merging DP algorithms with Bayesian decision making.In Bayesian decision theory (Berger, 2013), given a utility function U (θ, a) that maps possible outcomes to their utility, the parameters θ of the problem, and some decision a, the Bayesian optimal action is the one which maximizes the expected utility with respect to parameter uncertainty: In Arcieri et al. (2023) we incorporate DP methods into Equation 6to derive solutions that maximize the expected value with respect to the entire model parameter distributions, hence rendering the solution robust to model uncertainty.
In this work, we bring this framework into the RL training scheme.The We showcase the implementation of this framework with the belief-input method, but it is also applicable with the other methods reported in Table 1, with the only difference that the POMDP parameters θ are sampled at every episode from the inferred posterior distributions p(θ | D).The policy updates are again performed every 4,000 training time-steps and an evaluation iteration is run every 5 policy updates.Similarly to Figure 4, the performance during training is averaged at each evaluation iteration over 500 episodes with different random seeds.The analysis is then repeated for a second random seed to also average over the stochasticity of the NN training.The resulting average performance is plotted in Figure 6.Given the more challenging learning task, owing to model uncertainty, the average training performance decreases and demonstrates a higher variance than the belief-input performance without domain randomization, shown in Figure 4.For this case, the hyper-parameter tuning was also restricted to a minimal grid-search.While the results are already satisfying, the RL agent performance can likely be further increased via a more thorough hyperparameter optimization.
Again, the best performing models shown in the evaluations during training are saved and the learned policy is evaluated over 100,000 simulations.The results are shown in Table 3 and compared against the robust Q M DP policy described in Arcieri et al. (2023) and the upper bound optimal MDP policy evaluated with full observability, both assessed under model uncertainty.In addition, we report the result of the best model of the RL agent from the previous analysis, namely with the policy optimized without model uncertainty incorporated into the training (i.e., no domain randomization), evaluated now in the context of model uncertainty.This further analysis resembles a realworld deployment, where the environment parameters can differ from those inferred, inducing the aforementioned simulation-to-reality gap.The performance of the agent trained with no domain randomization deteriorates, while Evaluation iteration Total costs Fig. 6: Performance of the belief-input case (blue) over 250 evaluation iterations with domain randomization, i.e., a different POMDP model is sampled at every episode, both for training and evaluation.At every iteration, 500 trial episodes are evaluated with different random seeds and the average results are returned.The entire analysis is repeated for a second random seed and the average performance is plotted.An evaluation iteration is run after 5 policy updates and a policy update is performed every 4,000 training time-steps, for a total of 5 million time-steps.The performance is further benchmarked against the robust Q M DP method (dashed red) and the robust optimal MDP policy (dashed yellow), evaluated under model uncertainty as in Arcieri et al. (2023).
the agent trained with domain randomization is able to learn and deliver a more robust policy in the context of model uncertainty.probability distribution over hidden states, computed via Bayes' formula and fed to the policy networks; the true hidden states, which are not accessed by the agent and/or the belief computations; the actions planned by the RL agent.

Conclusion
This work tackles two key issues relating to adoption of RL applications in realworld partially observable planning problems.Firstly, a POMDP model, which enables the RL training via simulations, is often unknown and generally nontrivial to infer, with unified best practices not available in the literature.This constitutes a main obstacle against broad adoption of the POMDP scheme and its solution methods for real-world applications.Second, RL solutions often lack robustness to model uncertainty and suffer from the simulation-to-reality gap.
In this work, we tackle both issues via a combined framework for inference and robust solution of POMDPs based on deep RL algorithms.The POMDP inference is carried out via MCMC sampling of a HMM conditioned on actions, which jointly estimates the full distributions of plausible values of the transition and observation model parameters.Then, the parameter distributions are incorporated into the solution via domain randomization of the environment, enabling the RL agent to learn a policy, which is optimized over the space of plausible problem parameters and is, thus, robust to model uncertainty.
We compare three common RL modeling options, namely a Transformer and an LSTM-based approach, which constitute model-free RL solutions, and a hybrid belief-input case.We implement our methods for optimal maintenance planning of railway tracks based on real-world monitoring data.While the Transformer delivers generally better performance than the LSTM, both methods are significantly outperformed by the hybrid belief-input case.In addition, we demonstrate on the latter method that an RL agent trained with domain randomization is able to learn an improved policy, which is robust to model uncertainty, than an RL agent trained without domain randomization.
A possible limitation of this work is that, while our methods allow for incorporation of rather complex extensions, e.g., time-dependent dynamics and hierarchical components, and are here demonstrated on the quite difficult case of continuous observations, the POMDP inference under continuous multidimensional states and actions is still to be investigated.Future work will focus on the development of methods that can scale to these cases, e.g, via coupling with deep model-based RL methods (Arcieri, Wölfle, & Chatzi, 2021).
Fig. A2: Transition matrix related to action a 1 (tamping).The distribution at row i and column j is associated with the probability to transition from state i to j when action a 1 is taken.Deterioration of the system (upper right triangle) reflects an almost zero probability, while it appears most probable to remain in the same condition or improve by a maximum of one state, which reflects the reduced influence of this action.The distribution at row i and column j is associated with the probability to transition from state i to j when action a 2 is taken.Transition to the best possible state s 0 is consistently assigned the highest probability, regardless of the starting state, reflecting the higher repairing effect of this maintenance action.
A.2 Observation model parameters b diagnostic vehicles.In this work, we specifically exploit the fractal values, a substructure condition indicator extracted from the longitudinal level, which is measured by a laser-based system mounted on a diagnostic vehicle, to guide decisions for substructure renewal.The longitudinal level represents the deviations of the rail from a smoothed vertical position (Wang, Berkers, van den Hurk, & Layegh, 2021).On the basis of this measurement the fractal values can be computed, via appropriate filtering and processing steps.The fractal value indicator describes the degree of "roughness" of the track at varying wavelength scales.For the interested reader, the detailed steps of the fractal value computation are reported in Arcieri et al. (2023); Landgraf and Hansmann (2019).In particular, long-wave (25-70 m) fractal values, which are employed in this work, have shown a significant correlation to substructure damage (Hoelzl et al., 2021), and are used by railway authorities as an indicator which can instigate repair/maintenance actions, such as tamping.In this work, we use actual track geometry measurements, carried out via a diagnostic vehicle of the SBB between 2008 and 2018 across Switzerland's railway network.The track geometry measurements were collected twice a year for the investigated portion of track.The fractal values are computed every 2.5m from the measured longitudinal level.The performed maintenance actions have been logged for the analysed tracks over the same considered

Fig. 2 :
Fig. 2: A graphical model of the HMM inference.Arrows indicate dependencies, while shaded nodes indicate observed variables.
More recently, motivated by the breakthrough success of Transformers over LSTMs in natural language processing, Parisotto et al. (2020) designed a new transformer architecture, namely GTrXL, which yielded significant improvements in terms of performance and robustness over LSTMs on a set of partially observable benchmarking tasks.A main advantage of GTrXL is the capability to vary the dimensionality of the input over time.While LSTMs generally require a fixed window of h past observations, requiring the use of dummy observations in the first h − 1 decision time-steps, the GTrXL can at every time-step base the decisions on the entire history of past observations (and actions).Both LSTM and GTrXL architectures compose fully model-free deep RL solutions to POMDPs.A third modeling option, which comprises a model-based/model-free hybrid solution, pertains to transformation of the POMDP problem into the belief-MDP by computing beliefs via Bayes Theorem (Equation Appendix B. The average performance over 250 evaluation iterations (5 million training timesteps) is plotted in Figure 3. Along with the three evaluated methods, two additional benchmarking solutions are reported.The first option refers to the

Fig. 5 :
Fig. 5: The POMDP inference and robust solution framework via domain randomization and deep reinforcement learning.

Finally, Figure 7 Fig. 7 :
Fig. 7: Two trials of the maintenance actions planned by the belief-input model trained with domain randomization.From bottom to top: the observations (fractal values); the beliefs, namely a probability distribution over hidden states, computed via Bayes' formula and fed to the policy networks; the true hidden states, which are not accessed by the agent and/or the model; the actions planned by the RL agent.

Fig. A3 :
Fig.A3: Transition matrix related to action a 2 (renewal plus tamping).The distribution at row i and column j is associated with the probability to transition from state i to j when action a 2 is taken.Transition to the best possible state s 0 is consistently assigned the highest probability, regardless of the starting state, reflecting the higher repairing effect of this maintenance action.
(a) Posterior distributions of state-dependent parameters µ d|st .(b) Posterior distributions of state-dependent parameters σ d|st .(c) Posterior distributions of state-dependent parameters ν d|st .
(a) Posterior distributions of state-dependent parameters µ r|st (b) Posterior distributions of state-dependent parameters σ r|st .(c) Posterior distributions of state-dependent parameters ν r|st .(d) Posterior distributions of the autoregressive parameters k r|at for a 1 (left) and a 2 (right).

Table 1 :
s 1 , s 2 , and s 3 , reflecting various grades, from perfect to highly deteriorated state.This is chosen to coincide with the number of grade levels assumed by the Swiss Federal Railways for classifying substructure condition.It should be noted, that in the POMDP inference setting, the number of hidden states is not fixed.To this end, we evaluated further possible dimensions of the hidden states vector, as part of the POMDP inference presented in the next .e., (z 0 , a 0 , • • • , a 19 , z 20 ).Finally, the (negative) rewards representing costs associated with actions and states have been elicited from SBB and are reported in Table1in general cost units.Costs of the POMDP model.
section; a dimension of four yielded improved convergence and better-defined distributions.The fractal values are assumed as the (uncertain) POMDP observations, which correlate with the actual state of the substructure, but offer only partial and noisy information thereof.Unlike classical POMDP modeling of optimal maintenance planning problems, where observations are usually discrete, fractal values comprise (negative) continuous values, rendering the considered POMDP inference and solution quite complex.The problem definition is supplemented with information on the available maintenance actions.Three possible actions are considered, corresponding to the real-world setting, namely action a 0 do-nothing, and the aforementioned tamping and replacement actions, denoted as a 1 and a 2 , which can be interpreted as a minor and a major repair, respectively.The fractal value indicators are derived via measurements of the diagnostic vehicle every 6 months, which thus represents the time-step of the decision-making problem.Considering the almost 10 years of collected measurements, our real-world dataset is ultimately composed of time-series of 20 fractal values, per considered railway section, complete with information on respective maintenance actions (with "action" do-nothing included), i

Table 2 :
Performance of the best models inferred during the training process, evaluated over 100,000 simulations.

Table 3 :
Performance of the best models during training evaluated over 100,000 simulations in the context of model uncertainty with domain randomization.In particular, we report on the evaluation of the belief-input agent trained with (DR) and without Domain Randomization (no DR).The former achieves a significantly improved and more robust policy.