1 Introduction

With the digitalisation of rail infrastructure, an increasingly amount of data is becoming available, and automated tools based on artificial intelligence (AI) techniques are under development to extract information from them [1,2,3,4,5].

Different definitions of AI exist [6], since the definitions of artificial intelligence evolve based upon the goals that are trying to be achieved with an AI system, e.g. to imitate the human behavior, to use human reasoning as a model, etc..[7]. AI can be defined as “the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings”. AI techniques represent methods, algorithms and approaches enabling systems to perform tasks commonly associated with intelligent behaviour (e.g. machine learning, evolutionary computing [8, 9].) In the literature Bésinovic et al. [10], Ghofrani et al. [11], Yin et al. [12], aspects considered crucial when considering AI applications in the railway domain are: 1) the capability to accomplish tasks that would require critical intelligence that can be done by a human (e.g. decision-making); 2) the capability of taking into account uncertainties and/or unexpected scenarios (e.g. machine learning models for data-driven predictive maintenance); 3) Being able to learn from experience and take autonomous decisions in uncertain scenarios.

Machine learning is at the core of many approaches to artificial intelligence; according to Yong et al. [13] a large part of AI for railways today is based on machine learning (ML). In particular, ML is a branch of the broader field of AI that uses statistical models to identify anomalies and develop predictions [14, 15]. However, AI tools include also other techniques such as search algorithms, mathematical optimisation, evolutionary computation, logic programming, automated reasoning, probabilistic methods such as bayesian network and Markov model [4].

In this context, there is a high demand for a step change in asset management (AM) [16] to be delivered through innovative data-driven technologies and AI techniques [17,18,19,20,21,22].

This innovation of asset management strategy and techniques is particularly challenging in the railway sector where complex systems are integrated to achieve high safety and reliability standards. Kumari et al. [19, 20] develop and propose a concept for augmented asset management for railway assets, which involves the augmentation of AM with advanced analytics, based on digitalization and AI techniques, to provide augmented decision support for fleet management. McMahon et al. [21] analyse requirements and challenges for big data analytics applications to railway asset management and recommend that the research efforts should be directed to define potential data-driven analytics frameworks, and to integrate different data-driven approaches for condition and failure monitoring and decision support.

As stated by [22], detecting defects of the railway infrastructure in the early stage of defect development, can reduce the risk of railway operation, the cost of maintenance, and make the asset management more efficient.

To this aim, condition-based [23,24,25] and predictive maintenance approaches [26,27,28,29,30] were studied to evaluate rail asset current and future status, exploiting data collected in real-time by new monitoring technologies and sensors, installed on board trains and wayside.

Many initiatives for the innovation of AM approaches, through digitalisation and AI, are ongoing in the rail sector within the Shift2Rail and Europe’s Rail research framework with projects such as IN2SMART [31], IN2SMART2 [32], IAM4RAIL [33], IN2DREAMS [34], DAYDREAMS [35], and RAILS [36].

Moreover, a railway system is usually composed by numbers of interdependently linked subsystems, and a failed component or subsystem may differently affect the system performance, according to its function. Therefore, the evaluation of subsystems and components’ status and criticality is crucial for maintenance managers and operators [37]. The choice of maintaining the most degraded asset is not always the best one if, for example, the asset is redundant asset or its failure does not affect the service availability. For this reason, AM decision support systems should consider criticality considerations, besides the asset status evaluation, to identify the best AM strategy.

The goal is to achieve a prescriptive approach which is not only able to answer questions like “What is happening?” (the condition-based approach), or “What will happen?” (the predictive approach), but it can also provide answers to questions like “What could be done?” and “What are the best options?”, optimising, under context-specific asset management constraints, preferences and targets of railway stakeholders [11, 38,39,40].

Nevertheless, the decision-making in rail asset management is still too often based on the classical asset-oriented approach, which concentrates on the function of the asset itself as a main key performance indicator (KPI), whereas a user-oriented approach could lead to improved performance in terms of level of service.

Service-based asset management concentrates on provided services and serviceability of assets by applying service-level KPIs in order to reflect users’ perspectives in decision-making [41].

The availability of data on passengers’ transport demand collected by different systems, such as automated passenger counting (APC) and automated fare collection (AFC) systems, allows to exploit these data to extract important information to be used in many service-based decision-making processes.

Recent studies [42, 43] focus on the impact on passengers of service interruptions and delays caused by rail asset failures and maintenance to investigate and quantify how disruptions to rail services are perceived by passengers [42], and to propose mitigation measures [43]. However, the passengers’ perspective is usually not integrated in the decision-making framework for asset management. This leads to strategies that may keep assets in a high-quality condition but significantly affecting the passengers and the service, such as planning maintenance during days characterised by a peak of transport demand. Including passenger flow prediction as an input of the decision-making process would allow to avoid these situations, since maintenance tasks, which require temporary service disruptions or track closures, can be scheduled during off-peak time intervals. In addition, it would be possible to reduce the number of critical failures during a peak demand period, by planning preventive maintenance before the peak, ensuring a good condition of the assets when they are needed the most.​ Finally, maintenance strategies could be updated based on changing passenger behaviour, guaranteeing that maintenance plans could be always aligned with the most up-to-date rail line usage patterns.

The purpose of this paper is to fill this gap and move some steps towards a service-based asset management in the rail sector. In addition, the present study is aimed at answering to the research needs related to the definition of data-driven analytics frameworks and the integration of different data-driven approaches for condition and failure detection and maintenance decision support.

The addressed problem is the definition of a prescriptive maintenance approach able to suggest the needed predictive maintenance activities and the best order to perform those interventions according to selected KPIs and targets for the infrastructure manager.

The objective of the study is to develop a data-driven framework to prioritise predictive maintenance interventions according to asset status and criticality.

The prediction of rail assets’ status is performed based on an anomaly detection technique, and the capability of the proposed approach to predict failure and avoid corrective interventions, suggesting predictive interventions, is evaluated.

The prioritisation approach is able to represent different decision-making criteria, including service-level targets, and to assign different weights to the criteria according to their importance for the infrastructure manager.

The assumption is that the asset criticality can be decomposed in two terms:

  • a static term related to the type of asset and its specific function and types of failure;

  • a dynamic term related to the service condition in the considered time period.

While the static term is usually provided by the maintenance expert or the rail system designer, the dynamic term is evaluated considering the impact of failure on the service and the involved passengers according to the asset position and the utilisation rate of the different line sections over time. To achieve this objective, a model for the prediction of passenger flow at stations is proposed. This information is exploited in combination with the prediction of asset status to prescribe the needed predictive maintenance activities and to suggest the optimal sequence of interventions to the railway infrastructure manager. The ranking of the possible maintenance options is performed according to defined KPIs. Different options of maintenance ordering can be proposed to the infrastructure manager and the related KPIs can support the infrastructure manager in making the final decision.

The approach deals with a tactical level of decision making, providing a prioritisation of predictive maintenance interventions. The detailed scheduling of the maintenance activities with their allocation to maintenance time windows and work teams is out of the scope of this paper and represents a consecutive decisional phase that will use the prioritisation as an input.

The considered maintenance activities are predictive maintenance work orders generated by the anomaly detection model. The anomaly can lead to a predictive maintenance work order if it is detected sufficiently in advance, leaving enough time to organize the intervention.

Corrective maintenance interventions are neglected, but they can be integrated in the prioritisation list as activities with the highest priority to be executed as soon as possible.

Moreover, it is worth saying that the scope of the work is to show how different data-driven models and AI tools could be integrated in a data-driven framework for decision-making, including the passengers’ perspective in the maintenance planning. The identification of the best data-driven models for the considered data sets and the comparison with other existing methods is out of the scope of the paper.

The approach is suitable to estimate the predictive maintenance interventions needed in the upcoming week, considering a weekly time horizon.

2 Literature review

The proposed approach exploits different sources and types of data, linking three different models within a unique data-driven framework, to extract important knowledge to support maintenance decision-making.

In this section, the analysis of existing studies is presented considering the aims of the three models developed in this work: passenger flow prediction, rail assets status evaluation, and maintenance planning and decision support.

2.1 Passenger flow prediction

Considering the prediction of passenger flows and vehicle occupancy, several studies have addressed the topic of predicting the number of passengers on public transport vehicles, including trains and metro lines; they mainly differ in the used type of data and the adopted prediction model.

Regarding the prediction framework, the cyclical component of transport demand concerning the time of the day and the day of the week favored classical statistical prediction techniques based on time-series forecasting methods, such as autoregressive integrated moving average (ARIMA) and Kalman filters [44,45,46].

However, since the changes in traffic data are nonlinear in nature, all the above-mentioned models are limited in their performance and application due to the assumption of linearity [47].

Therefore, research has shifted more towards machine learning and deep-learning techniques that can model the non-linearity and the feature interactions. Liu et al. [47], Liu and Chen [48], Wang et al. [49], and Baek and Sohn [50] have successfully applied methods based on neural networks and deep learning, whereas Samaras et al. [51], Ding et al. [52], Vandewiele et al. [53], and Gallo et al. [54] have found that neural networks need large quantities of data to perform well, and are outperformed in their cases by tree-based algorithms like random forests, gradient boosted decision trees, and Bayesian based models.

Jenelius [55] applied lasso, stepwise regression, and boosted tree ensembles to predict passenger numbers on metro lines both at the stations and on the trains, testing the effectiveness of different typologies of data.

Few papers tried to solve the passenger prediction problem using Markov chain-based methods, even though they have been proven to provide promising results, since they can capture memoryless dependencies between the crowding of temporally close public transport services [56]. In addition, this prediction method has the advantage of requiring only recent data, without the need for large historical datasets.

In this paper, a Markov chain Monte Carlo (MCMC) is applied to estimate the passenger flow exploiting ticketing data. The technique is exploited within a wide framework, in which the final aim is the evaluation of asset criticality.

The choice of using MCMC technique is motivated by its adequate performance with data sets of limited size, with respect to approaches such as Bayesian networks methods and neural networks, which need a higher volume of data to provide good results.

Table 1 compares existing studies on passenger flow prediction.

Table 1 Comparison of the selected passenger flow prediction studies

2.2 Rail assets status evaluation

Regarding the second main goal of the present work, the asset status evaluation, different studies exist in the literature. In particular, several machine learning methods, including artificial neural networks, support vector machines and random forests have been used to evaluate the asset degradation status by analysing data coming from the equipment [35]. These data can be used both to identify equipment faults and to predict potential asset failures.

Thanks to the digitalisation of the sector, analytics can be used to manage large quantities of data and derive a prediction of the asset status [11, 57]. Pipe and Culkin [58] developed a data-driven model able to forecast the status of rail assets in order to achieve predictive maintenance strategies.

Machine learning can be divided into supervised, semi-supervised and unsupervised learning approaches [59]. Supervised models, such as artificial neural networks, support vector machines (SVM) and Bayesian networks, applied to fault detection, diagnosis and prediction, stand as an interesting option in operations with high maturity, whereby most possible faults are already mapped, measured and available for use to train models.

For example, Li and He [60] used a random forest based supervised methodology to predict the status and the remaining useful life of railcars combining multiple data sources. Niu et al. [61] used an adaptive pyramid graph method to detect anomalies in rail surface using images. Shim et al. [62] used deep learning to detect anomalies of wheel flats exploiting processed flat wheel signals. Li et al. [63] applied a supervised SVM technique that effectively uses large-scale data and provides valuable tools for operational sustainability and alarm prediction in railways. Considering signaling assets and, in particular, track circuits, a solution based on SVM is proposed by Sun et al. [64].

However, in the real world, new faults can happen in unmapped forms. This means that even a known fault might manifest in different ways. Therefore, even if supervised learning might bring trustful results for known cases, these models are limited in the case of unexpected events.

Semi-supervised models for fault detection can be seen as a variation of supervised models, since it is assumed that the training data have labeled instances for only the normal class. Therefore, any observation that deviates from the training data might be classified as a fault.

Conversely, unsupervised learning models do not consider any class variable. Unsupervised models focus on identifying similarities and discriminating clusters of observations with common characteristics [65].

In this paper, a one-class support vector machine (OCSVM) is applied. The choice of using OCSVM technique is motivated by the fact that it is one of the most well-established algorithms for outliner detection and a popular semi-supervised model already applied in rail sector [66]. The OCSVM model is here integrated within a framework for maintenance prioritisation and planning.

Table 2 summarises existing studies on rail assets status evaluation.

Table 2 Comparison of the selected rail assets status evaluation studies

2.3 Maintenance planning and decision support

In the literature, several methods and works have been proposed to schedule railway maintenance interventions, focusing on predictive maintenance.

Scheduling methods for predictive maintenance are expected to reduce both the maintenance cost and the risk of service disruptions; thus they should consider both the asset status and the asset criticality to identify the interventions priority [67]. In doing so, the most widely used minimisation targets are intervention costs and time duration, to maximise the system’s reliability and availability [68,69,70]. Solution approaches are based on heuristic approaches like tabu search algorithms, simpler greedy heuristic and genetic algorithms [71, 72].

Lopes Gerum et al. [73], Hamshari et al. [74], Chang et al. [75] include, in the scheduling method, data related to the asset condition, collected by diagnostic trains and sensors, and degradation models to predict the probability of asset failure. Mira et al. [76] integrates the maintenance scheduling into a fleet assignment model to schedule train services. Baglietto et al. [37] and Carretero et al. [77] include risk-based methodologies and take into account the asset criticalities, but they do not incorporate asset condition prediction methods.

However, scheduling optimisation models are NP-hard problems, which can be optimally solved within an acceptable computational time only for small instances, while heuristic algorithms, such as genetic algorithms, can provide sub optimal solutions also for bigger instances, but need long time to converge [78].

The ad-hoc prioritisation algorithm proposed in this paper allows to model the specific optimisation criteria of the infrastructure manager and to compare different solutions in a short computational time.

In addition, the ad-hoc optimisation algorithm has been formulated in order to include the outputs from the data-driven models for the evaluation of asset status and asset criticality.

Table 3 compares the existing studies on rail maintenance planning and scheduling.

Table 3 Comparison of the selected maintenance planning studies

Therefore, in this paper, a model for prioritising predictive maintenance interventions and mitigating the impact on service is developed. The goal is two-fold: on one hand, this study is aimed at predicting future failures based on the detected anomalies and, on the other hand, the focus is on the identification of asset criticality based on the impact that its failure would imply on the service. The detected anomaly is considered a true anomaly if it would be followed by a failure. Therefore, if a true anomaly is detected sufficiently in advance, a predictive maintenance intervention can be performed, having sufficient time to organize the intervention. The objective is to avoid critical failures and corrective interventions that would imply higher costs and service disruptions, as much as possible.

In this regard, the passenger flow along the line is estimated to identify the line sections with the higher utilisation rate and its variation over time, taking into account periodic fluctuations.

The asset status (i.e., functioning, low degraded, medium degraded, high degraded) is evaluated through a machine learning technique, the one-class support vector machine and a threshold-based approach.

Finally, a prioritisation algorithm is proposed to find the best compromise between different infrastructure manager’s targets and KPIs.

In summary, the innovative aspects of this paper are:

  • addressing a service-based asset management in the rail sector, with the introduction of the impact on final users in the evaluation of maintenance priority;

  • answering to the research needs related to the definition of a data-driven analytics framework for maintenance decision-making and the integration of different data-driven approaches;

  • linking different sources of data and different models for data analysis in the decision-making process, moving towards prescriptive maintenance strategies;

  • the evaluation of assets’ status by machine learning algorithms and their clustering according to degradation thresholds, defining predictive maintenance work orders;

  • the representation, in a prioritisation approach, of different decision-making criteria, including service-level targets;

  • the consideration of criticality concept in maintenance prioritisation, in addition to the asset status, to reduce the risk of critical failures and service disruptions;

  • the prediction of passenger flow at stations in order to estimate the involved passengers in case of failure.

3 Methodology

The proposed approach consists of three steps:

  • a Markov chain Monte Carlo technique is applied to evaluate the criticality of the assets according to the impact of their failure on the passengers, taking into account their position along the line, and exploiting the passengers data from the ticketing system of each station;

  • a machine learning algorithm, the one-class support vector machine, is applied to cluster the assets according to their status, based on data collected from the field on events and alarms logs, asset parameters, and maintenance data;

  • an ad-hoc ordering algorithm is developed to prioritise the interventions considering as inputs the results of the previous steps.

The proposed framework is described in Fig. 1. The “asset criticality evaluation” module computes the average passenger flow trend in the different line sections. By considering the asset position and the transport demand, a dynamic criticality value \({\pi p}_{i}^{\tau }\) is computed for each asset \(i\) in the time horizon \(\tau\).

Fig. 1
figure 1

The proposed AI-based three-step approach

In addition, given the type of failure and the impact caused by the failure in terms of duration of service interruption, a static criticality value \({\pi }_{i}\) is defined for each asset \(i\).

The “asset status evaluation” module exploits the data about alarms, parameters, and the maintenance data to assess the asset condition. The output is the list of the assets with an anomalous status, the related maintenance interventions to be done, and their due dates \({DD}_{i}^{\tau }\), computed in the considered time horizon \(\tau\).

Finally, the “asset maintenance prioritisation” module evaluates the best \(m\) sequences with the related KPIs values to be shown to the operator.

3.1 Asset criticality in terms of impact on passengers

As mentioned, a static term of criticality \({\pi }_{i}\) is considered, related to the type of asset and its specific function. It is usually provided by the maintenance expert or the rail system designer, given the data about the types of failures for each asset and their time to repair.

The dynamic term of asset criticality \({\pi p}_{i}^{\tau }\) is, instead, evaluated in terms of impact that the asset failure may cause to the service, taking into account the asset position and the transport demand over time.

In order to evaluate the impact on passengers, a Markov chain Monte Carlo (MCMC) approach has been applied to analyse the data of the passengers entering and exiting each station, which are available from the ticketing systems. The model estimates the passengers in the different sections of the line and their evolution over time.

Monte Carlo method is a family of computational techniques based on random sampling to obtain numerical results. The goal is to solve problems that might be deterministic using randomness. They are mainly used for optimisation, numerical integration and generating draws from probability distribution.

A Markov chain or process is a stochastic model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. In such a process, predictions of future outcomes can be made by considering the present state, and such predictions are just as good as the ones that could be made knowing the process's full history.

MCMC methods [79, 80] comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain. The more steps are included, the more closely the distribution of the sample matches the actual desired distribution.

In this study, the MCMC methodology was implemented to create the model that fits the data. A decomposable time series model is used. The three main model components are trend, seasonality, and holidays, combined as it is shown in Eq. (1):

$$y\left(t\right)=g\left(t\right)+s\left(t\right)+h\left(t\right)+{\epsilon }_{t}$$
(1)

where:

  • \(g\left(t\right)\) is the trend function which models non-periodic changes in the value of the time series;

  • \(s\left(t\right)\) represents periodic changes (e.g., weekly and yearly seasonality);

  • \(h\left(t\right)\) represents the effects of holidays which occur on potentially irregular schedules over one or more days;

  • \({\epsilon }_{t}\) is the error term which represents any idiosyncratic changes that are not accommodated by the model.

Two different models for the trend are considered: nonlinear saturation growth in Eq. (2), and linear trend with changepoints in Eq. (3).

$$g(t)=\frac{C\left(t\right)}{1+\mathit{exp}\left(-{\left(k+a\left(t\right)\right)}^{T}\delta \right)\left(t-\left(m+a{\left(t\right)}^{T}\gamma \right)\right)}$$
(2)
$$g\left(t\right)=(k+a{\left(t\right))}^{T}\delta +(m+a{\left(t\right)}^{T}\gamma )$$
(3)

where:

  • \(C\) is the carrying capacity;

  • \(k\) the growth rate;

  • \(m\) an offset parameter;

  • \(a(t)\) defined in Eq. (4) and \(\gamma\) in Eq. (5) model the trend changes.

Trend changes are incorporated in the growth model by explicitly defining changepoints where the growth rate is allowed to change. Suppose there are \(S\) changepoints at times \({s}_{j}\),\(j = 1,\dots , S\). The vector of rate adjustments \(\delta\) is defined, where \({\delta }_{j}\) is the change in rate that occurs at time\({s}_{j}\).

$${a}_{j}(t)\left\{\begin{array}{c}1, if \, t\ge {s}_{j}\\ 0, otherwise\end{array}\right.$$
(4)
$$\gamma _{j} = \left( {s_{j} - m - \sum\limits_{{l < j}} {\gamma _{l} } } \right)\left( {1 - \frac{{k + \sum\limits_{{l < j}} {\delta _{l} } }}{{k + \sum _{{l \le j}} \delta _{l} }}} \right)$$
(5)

Seasonality trends are modelled as Fourier series (Eq. (6)). \(P\) is the regular period and \(n\) the Fourier series order. The periodic changes are modelled as Eq. (7) shows.

$$X\left(t\right)=\sum_{n=1}^{N}\left({a}_{n}\mathit{cos}\left(\frac{2 \pi n t}{P}\right)+{b}_{n}\mathit{sin}\left(\frac{2\pi nt}{P}\right)\right)$$
(6)
$$s\left(t\right)=X\left(t\right)\beta$$
(7)

Holidays are modelled using a matrix of regressors, being \({D}_{i}\) the set of past and future dates for that holiday for each holiday, as shown in Eq. (8). The effect of holidays is modelled in a similar way than the periodic changes following Eq. (9).

$$Z\left(t\right)=[1\left(t\in {D}_{1}\right),\dots ,l(t\in {D}_{L})]$$
(8)
$$h\left(t\right)=Z(t)\kappa$$
(9)

A prior distribution representing our beliefs prior to the application of the algorithm was selected. The selected distribution for \(\delta\) is Laplace to obtain a sparse prior. The rest of priors are selected as Normal due to computational simplicity and central limit theory, where large number of independent and identically distributed random variables tends to follow a normal distribution. Nevertheless, in the tuning process different distributions such as Student’s t or exponential were used, and no significant changes on the results were found.

The prior distributions are represented in Eq. (1014).

$$K\sim Normal\left(0,{\sigma }_{k}^{2}\right)$$
(10)
$$m\sim Normal\left(0,{\sigma }_{m}^{2}\right)$$
(11)
$$\beta \sim Normal\left(0,{\sigma }_{\beta }^{2}\right)$$
(12)
$$\kappa \sim Normal\left(0,{\sigma }_{\kappa }^{2}\right)$$
(13)
$$\delta \sim Laplace\left(0,\tau \right)$$
(14)

The approach has been developed by using the Prophet Python library [81] which is an open-source library for univariate time series forecasting that allows to estimate trends over time considering fluctuations due to seasonality. In this study, as described in Sect. 4, seasonality represents periodic changes relative to week and day due to the considered time horizon and the available data set.

Figure 2 shows, in blue colour, the data related to the passengers entering a given station of the line (collected through the ticketing system) that were used to train the algorithm, whereas the forecasted values are depicted in orange. The model is fitted using Stan’s L-BFGS [82] to find a maximum a posteriori estimate.

Fig. 2
figure 2

The passenger flow prediction at a given station

Given the average passenger flow trend in the different line sections and knowing the asset position, a criticality value \({\pi p}_{i}^{\tau }\) is assigned to each asset \(i\) and time horizon \(\tau\), according to three criticality thresholds, as depicted in Fig. 7. In addition, a static criticality term \({\pi }_{i}\) is given by considering the type of assets and the impact caused by its failure, in terms of service interruption.

3.2 Asset status evaluation and suggestion of interventions

The asset status evaluation is based on support vector machines, a technique based on statistical learning theory with its root in structural risk minimisation (SRM) principle. It represents a well-known state-of-art machine learning tool which has been widely used in the last few decades. The one-class SVM algorithm, as defined in Schölkopf et al. [83, 84], is an extension of the original SVM; this technique can be used in an unsupervised setting for anomaly detection. Basically. the OCSVM algorithm separates all the data points from the origin (in the feature space) and maximises the distance from this hyperplane to the origin. This results in a binary function which captures regions in the input space where the probability density of the data lives. Thus, the function returns \(+1\) in a “small” region (capturing the training data points) and \(-1\) elsewhere.

The algorithm, trained with positive examples only (i.e. data points from the target class), allows only a small part of the dataset to lie on the other side of the decision boundary (the outliers). The quadratic programming minimisation function is slightly different from the original one presented by Cortes and Vapnik [85] and has the form:

$$\underset{w, {\mathit{ }\xi }_{i},\mathit{ }\rho }{{\text{min}}}\frac{1}{2}{\Vert w\Vert }^{2}+\frac{1}{\nu n} \sum_{i=1}^{n}{\xi }_{i}-\rho$$
(15)
$$\text{subject to:}$$
$$\left( {w \cdot \phi \left( {x_{i} } \right)} \right) \ge \rho - \xi_{i} \quad \forall i = 1, \ldots , n$$
(16)
$$\xi_{i} \ge 0\quad \forall i = 1, \ldots , n$$
(17)

In this different formulation, the solution is characterised by the parameter \(\nu\), which affects the smoothness of the decision boundary. The tuning of this parameter is two-fold: on one hand, it sets an upper bound on the fraction of identified anomalies (training examples regarded out-of-class) and, on the other hand, it represents a lower bound for the number of training examples used as support vector. Due to the importance of this parameter, this approach is often referred to as \(\nu\)-SVM.

By using Lagrange techniques and using a kernel function for the dot-product calculations, the decision function becomes:

$$f\left(x\right)={\text{sgn}}(\left(w\cdot \phi \left({x}_{i}\right)-\rho \right)={\text{sgn}}(\sum_{i=1}^{n}{\alpha }_{i}K\left(x,{x}_{i}\right)-\rho )$$
(18)

Thus, this method creates a hyperplane characterised by \(w\) and \(\rho\) that has maximal distance from the origin in the feature space and separates all the data points from the origin.

It was of interest to study and understand if OCSVM can perform well in the proposed case study. Since OCSVM should only be trained on healthy data to perform the training, only data points whose features fell in predefined ranges of values were considered. A hyper-parameter tuning step has been performed to find the best values to fit our dataset and the best kernel found is the Gaussian Radial Basis Function (RBF) reported in Eq. (19):

$$K\left(x,{x}^{\prime}\right)={\text{exp}}\left(-\frac{{\Vert x-{x}^{\prime}\Vert }^{2}}{2{\sigma }^{2}}\right)$$
(19)

where \(\sigma \in {\mathbb{R}}\) is a term of the kernel parameter and \(\| x-x^{\prime}\|\) is the dissimilarity measure.

Therefore, after setting the kernel, the hyper-parameters to be tuned are the smoothing parameter \(\nu\)  and the kernel parameter \(\gamma\), defined as \(1/2{\sigma }^{2}\).

The result of the training is the creation of a separating hyperplane, which can be visualised in Fig. 3, where points that fall within the plane are considered healthy, whereas points that fall outside are faulty.

Fig. 3
figure 3

Hyper-plane generated by the RBF

The OCSVM model classify the assets in “functioning” and “degraded”. An asset is functioning if it is behaving normally. An asset is degraded if it presents anomalies in its behaviour. When the asset status is degraded, it is classified in different levels of degradation through a threshold-based model that considers the number of anomalies detected for each asset and assign a level of degradation based on that number.

The threshold-based model defines a specific threshold for each asset, as reported in Table 4. The threshold is estimated according to the asset behaviour and represents the percentage of anomalies to assign to the asset a critical level of degradation.

Table 4 Anomaly detection model’s parameters and performance

The space of the hyper-parameters tested for the OCSVM module is \(\nu \in \left[0.01, 0.02, 0.05, 0.1\right]\) and \(\gamma\) in a logarithmic distributed space of 60 instances ranging from \({10}^{-6}\) to \({10}^{3}\), on the other side, thresholds have been tested in a range from 1 to 99%. An example of validation results for a subset of 10 assets is presented in Table 4.

The considered indicators of model performance are:

  • anomaly detection precision: \(\frac{\text{True anomalies}}{\text{Total occurred failures}}\);

  • true anomaly rate: \(\frac{\text{True anomalies}}{\text{Total identified anomalies}}\);

  • precision or positive predictive value \(\text{(}\mathrm{PPV)}=\frac{TP}{TP+FP}\);

  • specificity or true negative rate \(\text{(}\mathrm{TNR)}=\frac{TN}{FP+TN}\).

where:

  • true anomaly is an anomaly event which occurs at most 1,5 months before a reported failure;

  • false anomaly is an anomaly event which is not followed by a reported failure in the following 1,5 months;

  • true positive (TP) is the sum of anomalous patterns correctly predicted as anomalies;

  • false positive (FP) is the sum of normal patterns predicted as anomalies;

  • true negative (TN) is the sum of normal patterns correctly predicted as normal;

  • false negative (FN) is the sum of anomalous patterns identified as normal.

The data used to evaluate the assets status are:

  • ATS logs: collection of events and alarms log from the Automatic Train Supervision (ATS). These large set of logs allow to extract information about alarms and events related to every asset and information about trains movements;

  • asset parameters: meaningful asset parameters collected from the field;

  • maintenance data: collection of corrective and preventive maintenance activities for all the assets.

Based on the level of degradation of an asset, estimated according to the detected anomalies that contribute to the definition of the asset status, a specific maintenance intervention should be done within a suggested due date \({DD}_{i}^{\tau }\), for the considered time horizon \(\tau\). The output of the asset status evaluation is the list of the monitored assets with their status, the related predictive maintenance interventions to be done, and their due dates, as shown in Table 5.

Table 5 Asset status evaluation—output example

3.3 Asset ordering and intervention prioritisation

In this section, the predictive maintenance interventions, identified by the OCSVM model, are prioritised. To order the assets and the related predictive maintenance interventions, an ad-hoc prioritisation method is developed with the objective of mathematically representing the infrastructure manager’s prioritisation criteria and including in the approach the outputs from the data-driven models for the evaluation of asset status and asset criticality.

Therefore, the ordering algorithm uses as inputs the data received from the previous steps; in particular, the due date \({DD}_{i}^{\tau }\) of the maintenance intervention on asset \(i\) and the asset criticality \(\left({\pi }_{i}+{\pi p}_{i}^{\tau }\right).\)

The algorithm is based on an iterative approach aimed at finding the best sequence of assets to be maintained in a given time period \(\tau\). The main idea of the algorithm consists of computing, for each asset \(i,\) the cost \({d}_{i}^{k,r}\) for maintaining that asset in a given position \(p\left(k,r,i\right)\) within the sequence computed at iteration \(k\) of run \(r\). The term \({d}_{i}^{k,r}\) represents the weighted sum of the main infrastructure manager’s targets or KPIs for a given asset \(i\): \({d}_{1i}^{k,r}\) is the KPI related to the status of the asset \(i\) derived from the OCSVM model, \({d}_{2i}^{k,r}\) is the KPI related to the criticality of the asset \(i\) derived from the MCMC model, and \({d}_{3i}^{k,r}\) is the KPI related to the distance to be covered to execute the maintenance interventions according to the position of the asset \(i\) along the line.

In detail, considering a given time horizon τ and assuming, therefore, τ as constant, \({d}_{i}^{k,r}\) is expressed by the following equation:

$$d_{i}^{k,r} = \frac{1}{{b^{r} }}\left[ {\alpha_{1} d_{1i}^{k,r} + \alpha_{2} d_{2i}^{k,r} + \alpha_{3} d_{3i}^{k,r} } \right]$$
(20)

where:

$$d_{1i}^{k,r} = max\left\{ {0,\left( {t_{i}^{k,r} - DD_{i}^{\tau } } \right)} \right\}\quad \quad \forall i,\forall k,\forall r$$
(21)
$$t_{i}^{k,r} = \left( {\mathop \sum \limits_{j \prec i} S_{j - 1,j} + e_{j} } \right) + { }S_{j,i} + e_{i} \quad \quad \forall i,j,\forall k,\forall r$$
(22)
$$d_{2i}^{k,r} = \frac{{\pi_{i} + \pi p_{i}^{\tau } { }}}{{n + 1 - p\left( {k,r,i} \right)}}\quad \quad \forall i,\forall k,\forall r$$
(23)
$$d_{3i}^{k,r} = S_{{j_{{\left( {k - 1} \right),r}} ,i}} + S_{{i,h_{{\left( {k - 1} \right),r}} }} \quad \quad \forall i,j,\forall k,\forall r$$
(24)
$$b^{r} = \left\{ {\begin{array}{*{20}c} {k, if \, k \le \tilde{k}} \\ {k^{0.8}, if \, k > \tilde{k}} \\ \end{array} } \right.\quad \quad \forall r$$
(25)

The cost terms are represented by Eq. (21), Eq. (23) and Eq. (24).

Equation (21) represents the cost of executing the maintenance intervention after the due date, defined according to the asset status; Eq. (23) represents the cost related to postponing in the sequence the maintenance of an asset with a high criticality; Eq. (24) is the cost of executing in a consecutive order the maintenance of assets located far from each other. Equation (22) evaluates the time instant of maintenance execution for each asset in the sequence, while Eq. (25) defines the value of the parameter \({b}^{r}\) which changes value after \(\widetilde{k}\) iterations. The relevant notation can be summarised in Table 6. It is worth noting that only costs, affected by the order in which the maintenance interventions are executed, are considered and that the infrastructure manager can impose different weights \({\alpha }_{l}\) in the objective function according to the importance of the different KPIs.

Table 6 Notation of the ordering algorithm

The ordering algorithm is described in Table 7. At the beginning of each run \(r\), a feasible initial solution is built by randomly generating a sequence of assets. After that, the iterative steps of the algorithm follow. In doing so, at each iteration \(k\) of the algorithm, the parameters \({d}_{1i}^{k,r}, {d}_{2i}^{k,r}, {d}_{3i}^{k,r}\) for each asset \(i\) are computed and the assets \(i=1\dots N\) are sorted in descending order according to the maintenance cost \({d}_{i}^{k,r}\) to generate a new sequence. The iterations stop when \({d}^{k,r}\) is lower than a given threshold \(\delta\) or when the maximum number of iterations \(K\) are reached. At the end of the \(R\) runs, the best \(m\) sequences are shown to the operator with the related KPIs values.

Table 7 Algorithm

Since \({d}_{1i}^{k,r}, {d}_{2i}^{k,r}, {d}_{3i}^{k,r}\) are related to a specific asset \(i\), the final KPIs values associated to each sequence \(m\) that are shown to the operator are the total values:

$${d}_{1}^{k,r}={\sum }_{i}{d}_{1i}^{k,r}$$
(26)
$${d}_{2}^{k,r}={\sum }_{i}{d}_{2i}^{k,r}$$
(27)
$${d}_{3}^{k,r}={\sum }_{i}{d}_{3i}^{k,r}$$
(28)

The output of the prioritisation algorithm is reported in Sect. 4 Fig. 9, which shows for the two best sequences, the related KPIs values.

It is worth noting that the prioritisation considers the predictive maintenance work orders generated by the OCSVM anomaly detection model; however, a corrective maintenance intervention can be integrated as first in the ordered list assigning to it the highest level of priority.

To test its effectiveness and performance, the proposed approach has been applied to a real-world case study. Its details and the results are discussed in Sect. 4.

4 Case study and results

This section describes the results of the application of the approach described in Sect. 3 to a real-world case study, consisting of the metro line M5 of the Italian city of Milan (Fig. 4), a completely automated line composed by 19 stations, three of which are transfer stations to lines M1, M2 and M3.

Fig. 4
figure 4

The considered metro line

The total number of the considered assets is 850, belonging to these categories: track circuits, switches, platform doors, signalling equipment rooms, wayside antennas. These assets were selected according to their relevance for the infrastructure manager and the availability of data.

The asset status evaluation is performed using the ATS logs, the assets parameters and the maintenance data.

The ATS logs include a large number of records for each day, which mainly refer to two different categories: events and alarms. The typical log record is composed of three main parts: timestamp, involved system, and event description.

In Table 8, the list of alarms with their description is shown for each asset.

Table 8 Critical events and alarms

The maintenance data include both the preventive maintenance interventions, scheduled based on assets manuals, and the corrective maintenance interventions, scheduled after the occurrence of a failure during the normal train operations. In Table 9, an example of the maintenance data is shown.

Table 9 Maintenance data example

Following the methodology explained in Sect. 3.1, the real data sets of June 2020 of passengers entering and exiting each station are modelled, considering weekly and daily trends.

Before forecasting future values, in-sample predictions are made dividing the dataset in training/testing data using 80% of the data for the training set and 20% for the test set.

The information from the training data is used to train the model to forecast the testing data. The differences between the testing data and the forecasted testing data are used to calculate errors.

The obtained final results are:

  • train-test prediction: graph showing the training and testing data evolution with respect to time as well as the forecasted values of the testing dataset with their correspondent absolute error (Fig. 5);

  • forecasted values for particular days: different dates have been chosen to be forecasted for the different stations;

  • different seasonalities and trend obtained: plot of the previously defined model components with their uncertainty (Fig. 6). The increase of uncertainty around 3:00 am is due to the closing hours of the metro station, closing hours (from midnight until 6 am) are not taken into account to model and make predictions.

Fig. 5
figure 5

Train-test prediction for passengers entering a given station

Fig. 6
figure 6

Seasonalities and trains obtained using the model

Finally, the estimation of the average passenger flow at the different line stations is reported in Fig. 7, which shows the most critical sections of the line according to the passengers’ flow level. From this diagram, the criticality \({\pi p}_{i}^{\tau }\) is evaluated for each asset \(i\) according to its position along the line considering three criticality thresholds, represented in green, orange and red in Fig. 7, correspondent to high, medium and low criticality.

Fig. 7
figure 7

Passenger flow estimation at different stations (criticality thresholds in yellow, orange and red)

Cross-validation results for eight different stations are shown, as example, in Table 10. For each station, the mean error values are reported.

Table 10 Cross-validation results

The mean errors are in all cases lower than 8 passengers. The total mean error is 18.56% which is an acceptable value for the considered application of supporting maintenance decision-making.

The OCSVM model, instead, identifies the list of degraded assets among the 850 considered assets, with the indication of their current degradation and due date for maintenance execution. In Fig. 8, an example of asset status evaluation is depicted for a specific asset, the track circuit. Four different levels of degradation are considered to represent the asset condition. In particular, track circuits are represented in the track layout, coloured based on their status:

  • green colour indicates a functioning condition;

  • yellow colour indicates a low degraded condition;

  • orange colour corresponds to a medium degraded condition;

  • red colour corresponds to the high degraded condition.

Fig. 8
figure 8

Example of asset status evaluation for track circuits

The results show an average anomaly detection precision of 67% and a true anomaly rate of 76% for track circuits, reaching the 75% and the 80%, respectively, considering all the assets.

The availability of data about corrective maintenance activities allows to estimate if, with the application of the proposed approach, corrective interventions could have been avoided. In order to avoid a corrective maintenance intervention, the anomaly detection model should be able to detect the anomaly sufficiently in advance and with a sufficient level of precision.

The OCSVM identified 50% of track circuits’ anomalies sufficiently in advance to avoid a corrective intervention. Considering all the types of monitored assets, the percentage of avoided corrective interventions reaches 54%.

Therefore, even if the model identifies the anomaly with a high precision, it is not always possible to avoid the failure and the corrective intervention.

Other machine learning algorithms and statistics models are currently under evaluation to compare their performance such as the simple non regressive informed machine learning model applied in [86] for track circuits.

As mentioned, the asset status and criticality results are then used as input by the prioritisation algorithm.

A comparison between the scenario in which the maintenance planning is performed without considering passengers data and the scenario in which the estimation of criticality is done according to the number of passengers, is performed. The results show a reduction of the number of passengers affected by service interruptions is of around 37% in comparison to the scenario without passenger prediction.

Figure 9 shows the comparison of two prioritisation options (solution 1 and solution 2) according to the three considered KPIs: \({d}_{1}\)(asset status), \({d}_{2}\) (criticality), and \({d}_{3}\) (covered distance). Solution 1 performs better in terms of asset status (- 43% of the cost) but presents a worse performance in terms of asset criticality, while the cost related to the covered distance is comparable in the two solutions.

Fig. 9
figure 9

Comparison of two prioritisation solutions

It is worth noting that the decision support suggests the two best options, and the infrastructure manager can choose the preferred one according to its specific constraints or needs.

A sensitivity analysis has been conducted to test the robustness of the solution. The results show that the solution is robust since the choice of the weights \({\alpha }_{l}\) does not significantly affect the costs.

As an example, Fig. 10 depicts the relative variation of the costs \({d}_{1}\), \({d}_{2}\) and \({d}_{3}\) with the criticality weight \({\alpha }_{2}\), considering as reference values for \({d}_{1}\), \({d}_{2}\) and \({d}_{3}\) those corresponding to \({\alpha }_{2}=1\).

Fig. 10
figure 10

Costs relative variation with the criticality weight \({\alpha }_{2}\)

Finally, to clarify how the passenger flow affects the solution, two scenarios of transport demand are compared:

  • a reference scenario, with the nominal transport demand;

  • a passenger flow peak scenario, which has, in addition to the nominal demand, a predicted increase of passenger flow at station Monumentale M5.

In Fig. 11, the dynamic criticality \({\pi p}_{i}^{\tau }\) for the assets at the different stations of the line is reported. In detail, four discrete dynamic criticality values are assigned for high, medium, low and very low passenger flow.

Fig. 11
figure 11

Passenger flow criticality in the scenario with a passenger flow peak at station Monumentale M5

The dynamic criticality corresponding to the nominal transport demand is reported in blue, and the increase in the dynamic criticality, due to the passenger flow peak at the station Monumentale M5, is depicted in red. These values are assigned to each asset \(i\) according to the station of the line where the asset is located.

The prioritisation, in the reference scenario and in the peak demand scenario, for a subset of 20 assets is reported in Table 11. With respect to the solution obtained in the reference scenario, the decision support system suggests to maintain earlier the asset located at the Monumentale M5 station (highlighted in bold in Table 11), which is moved from position 20 to position 9 in the maintenance activities’ prioritised list. As a consequence, the ordering of the other assets in the proposed prioritisation changes as well, to optimise also the KPIs values \({d}_{1}\) (asset status), and \({d}_{3}\) (covered distance).

Table 11 Comparison of maintenance activities’ order in the reference and passenger flow peak scenarios

In particular, the asset status cost \({d}_{1}\) increases of 4.9%, while the covered distance cost \({d}_{3}\) increases of 23% with respect to the reference scenario results. This shows the capability of the model to deal with passenger flows peaks while keeping a good performance in terms of asset status, with a small increase of the covered distance. The variation of the covered distance is usually acceptable since it is deemed less relevant for the infrastructure manager in comparison to the asset status.

5 Conclusions

This paper proposes a novel data-driven prioritisation framework to prioritise maintenance interventions on railway lines taking into account the asset status and criticality. More in detail, a dynamic criticality term related to the service condition in the relevant time period is considered, which is updated on the basis of the passenger flow trend over time at the different stations. The proposed three-step approach includes the analysis of passenger data to evaluate the failure impact on the service, the analysis of alarms and anomalies to evaluate the asset status, and the suggestion of maintenance interventions. The application to the maintenance of the metro line M5 in the Italian city of Milan shows the usefulness of the proposed approach to support infrastructure managers and maintenance operators in making decisions regarding the priority of maintenance activities, reducing the risk of critical failures and service interruptions, and paving the way towards the adoption of prescriptive maintenance strategies.

Based on the asset status and criticality, a list of predictive interventions on track circuits, switches, platform doors, signalling equipment rooms and wayside antennas is calculated. In this way, these interventions are planned more efficiently and have a lower impact on service quality.

The results show a good precision in the detection of the anomaly with an anomaly detection precision of 75% and a true anomaly rate of 80%.

The percentage of avoided corrective interventions that are identified through the data-driven model is around 54%, which represents the corrective interventions that are detected sufficiently in advance, and replaced by predictive interventions that can be planned in advance more efficiently.

The reduction of the number of passengers affected by service interruptions is around 37% in comparison to the scenario without passengers prediction.

This paper does not focus on the identification of the best data-driven models for each considered data set and the comparison with other existing methods. For this reason, future developments will consist of testing and comparing the performance of various machine learning algorithms and statistics models and improving the accuracy and the prediction horizon of the forecasting model of asset status. The aim is to predict the failure more in advance, avoiding a higher percentage of corrective interventions.

The presented system represents the backbone of the intelligent asset management system that was developed, implemented, and validated by the IN2SMART2 project.