1 Introduction

Typical railway transport systems (RTSs) comprise complex interconnected components that are often expected to operate at significantly high levels of reliability to meet the ever-growing expectations of passengers, operators, and regulators [1,2,3]. However, despite the introduction of generic reliability, availability, maintainability, and safety (RAMS) guidance standards such as EN50126 for enhancing the interoperability, reliability, and safety of railway systems, trains continue to suffer costly delays, cancellations, and technical failures. Although maintenance continues to be the natural remedy for identifying and rectifying such failures, it often exerts a considerable strain on operational budgets [1, 2]. For instance, Liden [4] reported that the estimated annual combined RTS maintenance budget of the European Union member countries typically is between 12.8 and 21.3 billion Great British Pound (GBP£) per 300,000 km of rail track. Despite significant progress in incorporating reliability, risk, and maintenance in RTSs, most of these applications are used discretely without the benefits of an integrated approach, where the limitations of one technique are compensated by the strength of another [3, 5,6,7,8,9,10,11,12,13,14,15]. Such individualised reliability, risk and maintenance techniques also contribute to the issues and conflicts in decision-making between RTS designers, operators, and suppliers [1, 2]. Thus, ineffective decision-making regarding maintenance improvement has negative consequence on system availability and throughput irrespective of the inherent design reliability characteristics [6, 7, 14, 15].

While these discrete risk- and reliability-centred approaches, including failure mode effects and criticality analysis (FMECA), fault tree analysis (FTA), event tree analysis (ETA), and reliability block diagram (RBD), provide useful theoretical constructs towards the understanding of the causal relationships between failures, they can be unidirectional, as they primarily focus on failure identification and do not directly contribute to identifying appropriate maintenance strategies. The objective of this paper is to introduce a new hybrid model for RTSs that integrates reliability, risk, and maintenance strategy selection and can be adopted by the RTS designers, operators, and suppliers in the three main life cycle phases (design, operation, and maintenance).

2 Literature Review

Studies conducted by Stephen and Labib [13] on hybrid model for learning from failures and Yunusa-Kaltungo et al. [14] on the investigation of critical failures using combinations of root case analysis techniques emphasise that the isolated application of individual approaches limits the ability to apply the strengths of some tools to compensate for the limitations of others. From a practical standpoint, the application of conventional risk and reliability assessment tools such as FMECA and RBD analysis for optimising RTS operations have been explored by several researchers, including analysis of domino effect in process industry using ETA by Alileche et al. [10], ETA for flood protection by Rosqvist et al. [6] and linking risk analysis to safety management by Trbojevic [7]. A fundamental strength of these tools is their simplicity, user-friendliness, and, most importantly, versatility in terms of applicability to various types of industry. Examples of the discretised application of these tools to RTSs include the development of a systematic approach for assessing hazards and human failures in train control systems by Renjith et al. [16], determination and validation of the barriers to failure prevention in RTSs using FTA, Haddon’s ten energy-based injury prevention approaches by Li et al. [17], and risk evaluation of railway rolling stock failures using FMECA technique by Dinmohammadi et al. [3]. Similarly, decision-making grid (DMG) for maintenance selection and strategy allocation has been applied by researchers’ in various disciplines, and the accuracy depends on the quality of data utilised for the horizontal boundaries of the decision map [18,19,20]. The Analytical hierarchy process (AHP) has been applied discretely to analyse complex projects and dynamic problems using multicriteria decision analysis [21, 22]. The perceived weakness of AHP lies within the consistency of results under different questions, even if the goal or target remains the same as demonstrated by Kamal et al. [23] in their study on the application of AHP in project management. However, several of these shortcomings have been adequately addressed by Saaty [24] where AHP was used to provide a flexible, systematic, and repeatable evaluation process for selecting optimal alternatives amidst multiple criteria.

Nystrom and Soderholm [25] attempted to address these shortfalls by proposing an RTS maintenance decisionmaking framework that uses AHP for prioritising competing maintenance initiatives. Similar integrated approaches using fuzzy FTA analysis with AHP have been proposed to address reliability issues in the railway industry [26, 27]. In Huang et al. [26], maximum probabilities of railway system traffic failure were identified using predefined alternative fuzzy sets, while Song et al. [27] identified the maximum probability of railway system traffic failures using predefined alternative fuzzy sets. For addressing maintenance issues, the rolling stock sector is adopting proactive techniques such as reliability-centred maintenance (RCM); however, in most cases such measures are implemented after the design stage, thereby increasing the implementation cost [28]. Following the realisation that some of these tools can extend to approaches beyond failure identification, there has been further exploration towards improving RTS safety and overall performance [28,29,30,31]. The RCM-based studies focussed on the identification and ranking of high-impact failure modes (FMs) associated with RTS components to improve maintenance decision-making. Although these RCM-based studies have improved the ability of railway industry maintenance and operations managers to direct scarce resources to areas in which they might be most needed, holistic asset management frameworks should be capable of modelling the fundamental events that trigger such FMs in addition to the causal relationships between them.

Furthermore, hybrid frameworks proposed by Stephen and Labib [13] and Sargent and Hall [32] in the context of operations engineering can be fundamentally grouped into two classes—hybrid models and hybrid modelling—based on the use of models and output procedures, and their corresponding applications, respectively. The focus of this paper is primarily on hybrid models in which the outputs from one tool/technique systematically form inputs of another. Additionally, studies such as Stephen and Labib [13] on a hybrid model for failure analysis, Labib and Read [33] on learning from failures, and Sargent and Hall [32] regarding the historical view of hybrid simulation and analytic models have attempted to identify the key drivers for the application of hybrid models to industrial research. Based on the premise that the management of industrial failures and selection of cost-effective maintenance strategies together account for a significant proportion of overall downtime [34], our primary objective was to create an approach that simplifies the process of managing these issues by capitalising on the strengths of existing tools to ease their deployment and acceptance by industry. Other attempts aimed at applying hybrid models consisting of two or more techniques include Yunusa-Kaltungo et al. [14] for investigating the critical failures using root cause analysis with FTA and RBD, Zubair et al. [35] on nuclear accident precursors using AHP and Bayesian network models, the proposal of Ishizaka and Labib [36] for a hybrid integrated approach using FTA and AHP to analyse disaster prevention, and the work of Appoh et al. [37] on the hybrid dynamic probability model for complex train failure analysis using Bayesian network and Petri nets. A significant number of existing hybrid models focus primarily on learning from failure and on accident investigation. While such approaches are central to continuous improvement in all fields, including reliability and asset management, the obtained output is generally a set of action plans that do not necessarily iterate the decision-making stage of the process. Additionally, such hybrid approaches are often restricted to two or possibly three methods, while our approach involves the seamless application of multiple tools while explaining the usefulness of each.

Our review of the literature on integrated reliability, risk, and maintenance selection techniques revealed a lack of in-depth studies on the interdependency between these strategies at the design phase to ensure the holistic dependability assessment of rolling stock systems. Therefore, our emphasis in this study was not to introduce entirely new tools but instead to propose a novel approach that integrates discrete reliability, risk, and maintenance tools into a single optimised holistic framework. In addition to enabling synergy among individual tools, the proposed integrated framework can potentially help reduce the incidence of diagnosed faults in complex rolling stock systems and improve the decision-making element of maintenance downtime through the conventional serial application of different techniques. Thus, this paper adds two fundamental contributions to the existing research on hybrid models for reliability, risk, and maintenance optimisation for RTS: first, it provides an improved approach for evaluating reliability and risk using FMECA and FTA, and simultaneously assigns maintenance strategies through the use of a decision-making grid (DMG) as part of a single framework for assessing subsystems early in their design phase; second, the model provides an opportunity to identify actions to improve subsystems allocated to important regions of the maintenance decision map via AHP pairwise comparison features that take both competing factors and constraints in the organisation into account.

The remainder of this paper is organised as follows. Section 3 provides a description of the proposed hybrid model demonstrating the integration of its upstream reliability and risk assessment methods with its downstream maintenance decision-making method. In Sect. 4, we present a practical demonstration of the functioning of the model using real-life PBU data provided by a UK train operator. Finally, Sect. 5 provides the conclusion and discusses potential future work.

3 Proposed RTS Hybrid Model

A flowchart of the proposed framework is shown in Fig. 1. The upstream segment relates to system decomposition, risk, and reliability assessments through FMECA and FTA techniques, while the downstream segment focuses on decision-making using DMG for the initial allocation of maintenance strategies and further analysis using AHP to assign improvement actions to subsystems in the worst region of the decision map. Both upstream and downstream elements could be applied in isolation, depending on the technical requirements of the system under study. In specific cases, when only reliability and risk requirements need to be considered, the upstream element is applied. Where prior allocated asset management strategies for the subsystems are deemed desirable within the first phase of the downstream element (i.e., maintenance allocation), then further analysis to identify improvement actions for the subsystems may not be required. The model can be applied by the RTS vehicle designers, original equipment manufacturers and suppliers, and train operators as a continuous improvement model.

Fig. 1
figure 1

Conceptual flowchart of the proposed RTS hybrid model

3.1 Input Data for the Hybrid Model

Existing RTS subsystems, which require an upgrade or modification, can use historical data from the maintenance management system (MMS) as a foundation for analysis, as shown in Fig. 1. For a new subsystem without MMS data, information obtained from similar existing systems and technical design specifications can serve as the basis for the analysis. FMs and failure rates data obtained from the FMECA analysis will serve as the input data to the FTA. Design information, failure rate, and schematics will also provide additional input data for the qualitative and quantitative FTA. The input data for the DMG, such as failure frequency and delay minutes, can be obtained from the existing MMS data. Besides, purchasing information, and repair and planning data can equally serve as the input data to the DMG. For novel subsystems without historical data, data from a similarly configured subsystem and intended mission profile information can form the DMG input data. Finally, the data, project goals, improvement alternatives, and criteria for the AHP will be established by the RTS design project team as part of the brainstorming exercise, elicitation, or facilitation processes. The input data for developing the DMG criteria can come from the MMS data. The improvement options or alternatives can be established by key stakeholders in the RTS project team.

3.2 Processing Procedures for the Proposed Hybrid Model

The proposed hybrid modelling process can be categorised into four main steps:

  • Step 1: Subsystem definition and failure modes classification

This stage involves establishing component functions, functional failures, and corresponding FMs and their effects on the entire subsystem under study [3] as shown in step 1(a). In this study, the emphasis is on the risk priority number (RPN). The objective of this step is to define and identify all critical FMs based on RPN as part of criticality analysis. Here, we denote a FM \(k\) within a subsystem \(j\) by \({\mathrm{FM}}_{kj}\). Each \({\mathrm{FM}}_{kj}\) is assigned an RPN based on estimation of its severity (\({S}_{kj}\)), occurrence (\({O}_{kj}\)), and detectability (\({D}_{kj}\)), as follows:

$${\mathrm{RPN}}_{kj}={S}_{kj}\times {O}_{kj}\times {D}_{kj} k=1, \dots ,m, j=1, \dots ,{n}_{k}$$
(1)

where \(k\) represents the first failure mode from \(k=1\) to last failure mode \(m\) and \(j\) represents the first subsystem from \(j=1\) to the last subsystem \(j={n}_{k}.\)

RPN is assigned to each FM identified as part of the FMECA process. For mission-critical systems in which detectability (\({D}_{kj})\) forms an integral component of severity (\({S}_{kj}\)), the \({\mathrm{RPN}}_{kj}\) in Eq. (1) can be rewritten as [38]

$${\mathrm{RPN}}_{kj}= \sum_{j=1}^{{n}_{k}}\left({S}_{kj}\times {O}_{kj}\right).$$
(2)

Thus, \({O}_{kj}\) of the \({\mathrm{RPN}}_{kj}\) is evaluated by selecting the frequency rating in column 1 using the range of predicted failure rate in column 2, as shown in Table 1. Similarly, the severity level \({S}_{kj}\) is selected from Table 2 in row 2 based on the consequence of the failure. \({\mathrm{RPN}}_{kj}\) is then estimated using Eq. (2) where \({D}_{kj}\) is considered as part of \({S}_{kj}\) for mission-critical systems.

Table 1 Failure mode frequency rating derived from EN 50126 [39]
Table 2 Failure mode severity rating derived from EN 50126 [39]

The next stage in step 1(b) of Fig. 1 is to determine the overall subsystem risk level based on the overall component failure mode. The overall railway rolling stock subsystem risk level (\({T}_{kj}\)) can be obtained by evaluating overall component failure frequency level (\({F}_{kj}\)) in the vertical axis on a scale of one (very unlikely) to six (frequent) and overall severity level (\({H}_{kj}\)) in the horizontal axis for the subsystem on a scale of one (insignificant) to four (catastrophic), using Eq. (3) and the ranking in Table 3 [39]:

$${T}_{kj}=\prod_{j=1}^{{w}_{k}}\left({F}_{kj}\times {H}_{kj}\right) k=1,\dots , m; j=1,\dots ,{w}_{k}$$
(3)

where \(k\) represents the first overall component’s frequency and severity from \(k=1\) to the last overall component’s frequency and severity \(m\), and \(j\) represents the number of overall components within the subsystems from the first overall component \(j=1\) to the last overall component \(j={n}_{k}.\) Note that the risk matrix may represent different scales for different organisations based on the frequency of asset failures and their severities. The overall component failure rate is assessed based on the individual failure mode failure frequency. Similarly, the overall severity level is assessed on the basis of the combined impact of the individual failure mode severity.

Table 3 Risk evaluation matrix [39]
  • Step 2: Determination of failure causal relationships and overall reliability

Step 2 of the hybrid model deals with the estimation of failure rate and reliability. The most critical \({\mathrm{FM}}_{kj}\) are used as inputs to the FTA, which illustrate the logical relationship between the top failure event (overall failure rate) and provides an alternative means of investigating failures [38]. Assuming that all failure events are statistically independent, the probability of the top event (TE) failure in the FTA (i.e., \(P\left(TE\right)\)) is given by [4042];

$$P\left(TE\right)=\prod_{i=1}^{n}{P}_{i}$$
(4)
$$P\left(TE\right)=\sum_{i=1}^{n}{P}_{i}$$
(5)

In (4) and (5), each basic event from \(i\) to \(n\) is modelled using OR-gate and AND-gate, respectively. Despite the known versatility of FTA, it is quite common for it to be merged with RBD to produce a simplified graphical representation of a system and directly obtain its success probability [33, 41]. For a series configuration with statistical independence of events, where a failure of any component within the subsystem can result in downtime, the output of an OR-gate corresponds to a series system with a series of independent basic event probabilities \({P}_{i}\) for events \(i\) to \(t\). Here, the probability of failure \({P}_{F}\) can be estimated as

$${P}_{F}=1-\prod_{i=1}^{t}\left(1-{P}_{i}\right) i=1,\dots ,t$$
(6)

The mean failure rate λ and repair rate μ for a series RTS subsystem in which basic events from \(i\) to \(t\), assumed to have statistical independence, can be defined, respectively, as

$$\lambda =\sum_{i=1}^{t}{\lambda }_{i}$$
(7)
$$\mu =\sum_{i=1}^{t}{\lambda }_{i}\frac{\prod_{i=1}^{t}{\mu }_{i}}{\prod_{i=1}^{t}\left({\lambda }_{i}+{\mu }_{i}\right)-{\prod }_{i=1}^{t}{\mu }_{i}}$$
(8)

The output event of an OR-gate for a series-connected subsystem can then be estimated as [40]

$${P}_{F}=\frac{\lambda }{\lambda +\mu }=1-\prod_{i=1}^{t}\frac{{\mu }_{i}}{{\lambda }_{i}+{\mu }_{i}}$$
(9)
$${P}_{F}=\prod_{i=1}^{t}{P}_{i}$$
(10)

Similarly, the repair rate μ and failure rate λ of a parallel configuration subsystem can be used to estimate the output event of an AND-gate assuming statistical independence of events:

$$\mu =\sum_{i=1}^{t}{\mu }_{i}$$
(11)
$$\lambda =\sum_{i=1}^{t}{\mu }_{i}\frac{\prod_{i=1}^{t}{\lambda }_{i}}{\prod_{i=1}^{t}\left({\lambda }_{i}+{\mu }_{i}\right)-\prod_{i=1}^{t}{\lambda }_{i}}$$
(12)
$${P}_{F}=\frac{\lambda }{\lambda +\mu }=\prod_{i=1}^{t}\frac{{\lambda }_{i}}{{\lambda }_{i}+{\mu }_{i}} .$$
(13)

With Eqs. (6)–(13), the overall failure rate and thus reliability of the subsystem can be estimated, as shown in step 2 of Fig. 1. The basic symbols for FTA are shown in Table 4.

Table 4 Basic symbols for FTA
  • Step 3: Maintenance allocation DMG

The third stage of the hybrid model involves the selection of appropriate maintenance strategies using the DMG. This process is undertaken considering reliability and risk estimates as well as operational factors, financial information, and train information using data from MMS [step 3(a) Fig. 1)]. This stage involves the application of clustering analysis over a distance interval that allows for equal and robust criteria measurements for the two selected rankings of factors that most impact the RTS subsystem [43]. For maximum \({X}_{\mathrm{max}}\) and minimum values \({X}_{\mathrm{min}}\), the ranges of the three criteria for the two ranking factors as indicated in step 3(b) are estimated as follows:

$${\text{High}}_{{{\text{criterion}}}} = \left[ {\left( {X_{\max } } \right), \left( {X_{\max } - \left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right]} \right)} \right]$$
(14)

where \({\text{High}}_{{{\text{criterion}}}}\) ranges from the small value of \(\left( {X_{\max } - \left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right]} \right)\) to the large value of \(\left( {X_{\max } } \right)\)

$${\text{Medium}}_{{{\text{criterion}}}} = \left[ { \left( {X_{\max } - \left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right] } \right),\left( {X_{\max } - 2\left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right]} \right)} \right]$$
(15)

where \({\text{Medium}}_{{{\text{criterion}}}}\) ranges from the small value of \(\left( {X_{\max } - 2\left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right]} \right)\) to the large value of \(\left( {X_{\max } - \left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right]} \right)\)

$${\text{Low}}_{{{\text{criterion}}}} = \left[ {\left( {X_{\max } - 2\left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right] } \right),\left( {X_{\min } } \right)} \right]$$
(16)

where \({\text{Low}}_{{{\text{criterion}}}}\) ranges from the small value of \(\left( {X_{\min } } \right)\) to the large value of \(\left( {X_{\max } - 2\left[ {\frac{{X_{\max } - X_{\min } }}{3}} \right]} \right)\).

The output of the DMG is a tri-quadrant decision map, in which the respective segments represent the defined asset management strategies for the organisation. Each organisation can define unique asset management strategies relevant to the operation of the RTS subsystem. Here, the following strategies were adopted for the decision map; breakdown maintenance (BM), planned preventive maintenance (PPM), condition-based maintenance (CBM), skill level upgrade (SLU), and design out maintenance (DOM), as shown in Fig. 2 [18, 44].

Fig. 2
figure 2

Typical DMG showing different maintenance strategies and their boundaries

This map is usually constructed using a combination of data measured in relation to the key performance drivers relating to the railway operation—in this case, failure frequency and delay minutes originating from MMS. The three criteria will form the tri-quadrant axes for the DMG decision map, as shown in Fig. 2 and as illustrated in the model [step 3(c) Fig. 1]. The boundary measurements for the homogenous criteria drivers are equally partitioned to enable holistic capturing of the extremes in the measurement data, as shown in Eqs. (14)–(16) and indicated in step 3(d) of the proposed model.

  • Step 4: Maintenance improvement by AHP

The fourth and final step in the hybrid model involves the selection of improvement options through multi-criteria decision-making method (MCDM) where improvement options are selected for the worst regions, i.e., subsystems located on the far-right corner of the decision map, such as DOM in Fig. 2. The following steps explain AHP as a method for MCDM [45,46,47]:

  1. i.

    Establish the goal, that is the first layer, for maintenance improvement with the RTS project team and stakeholders as indicated in step 4(a).

  2. ii.

    Define and organise the criteria \(m\) as shown in step 4(b) by clustering them under different hierarchy levels using key performance indicators defined by the RTS project team or from the MMS data.

  3. iii.

    Identify the maintenance improvement options or alternatives \(n\). The RTS project team or key stakeholders will score the alternative \(n\) against the criteria \(m\). A ratio \({w}_{i}/{w}_{j}\) which is the weight of alternative \(i\) to \(j\) is assigned to criteria \(m\) to reflect the relative importance of the decision. Thus, the decision maker, normally the RTS project manager, will create judgement matrix \(\left(m\times m\right)\) with a dimension of \((n\times n)\) alternative for pairwise comparison \({a}_{ij}\) which is an approximation of the ratio \({w}_{i}/{w}_{j}\). The value assigned to \({a}_{ij}\) is typically in the interval \([1/9, 9].\) The estimated weight vector \(w\) is found by solving the following eigenvector problem:

    $${A}_{W= }{\lambda }_{\mathrm{max}}W$$
    (17)

    where \({\lambda }_{\mathrm{max}}\) is the principal eigenvalue for the pairwise comparison matrix A,

    \(A=\left[\begin{array}{ccccc}\cdots & {C}_{1}& {C}_{2}& \dots & {C}_{n}\\ {C}_{1}& {w}_{1}/{w}_{1}& {w}_{1}/{w}_{2}& \dots & {w}_{1}/{w}_{n}\\ {C}_{2}& {w}_{2}/{w}_{1}& \dots & \dots & \dots \\ {C}_{n}& {w}_{n}/{w}_{1}& {w}_{n}/{w}_{2}& \dots & {w}_{n}/{w}_{n}\end{array}\right]\) for \({C}_{1},\dots ,{C}_{n}\) where \(n \ge 2\) criteria.

  4. iv.

    Next, the consistency index (CI) is determined by computing \({A}_{W}\) and approximating the minimum eigenvalue, \({\lambda }_{\mathrm{max}}\) as \(\left({\lambda }_{\mathrm{max}}-n\right)/\left(n-1\right)\), where \(n\) is the matrix size as indicated in step 4(c). The consistency ratio (CR), which is a test for reliability of consistency, for a given reciprocal matrix can be obtained by estimating the ratio of CI to the average random consistency index (RI) as shown in Table 5. The pairwise comparisons in a judgement matrix are considered adequate if the corresponding CR is less than 10% [45,46,47]. It is a feedback to the decision makers to capture logical and reasonable preferences when making judgements. Table 6 illustrates the fundamental scale of relative importance.

    Table 5 Random consistency index (RI) [42]
    Table 6 Scale of relative importance numbers [45,46,47,48, 55]
  5. v.

    The last step of the AHP process, as shown in step 4(d) of Fig. 1, is to conduct group decision-making to select an alternative option to improve the maintenance of the RTS subsystems. To ensure a coherent approach to decision-making and agreement within the group regarding the decision in the context of AHP, particularly in an RTS organisation that comprises various disciplines and experts, Shannon entropy \(H\) (where \(H\) can be interpreted as a measure of evenness of priorities among the criteria for individual decision makers) is proposed [49]. Shannon alpha and beta entropies for \(N\) criteria and \(K\) decision makers represent the mean Shannon entropy of group decision makers [49,50,51]:

    $${H}_{\alpha }={w}_{1}\sum_{i=1}^{N}{p}_{i1}ln+{w}_{2}\sum_{i=1}^{N}{p}_{i2}+\dots$$
    (18)

    where \({p}_{i}\) denotes the calculated priorities for criteria \(i=1\) to \(N\). Assuming equal weights \({w}_{1}={w}_{2}={w}_{k}=1/K\), then the effective number of criteria for Eq. (18) is represented by \(D_{\alpha } = \exp H_{\alpha }\). We can estimate the Shannon gamma diversity \({H}_{\gamma }\) for the group aggregated priorities as follows:

    $$H_{\gamma } = \mathop \sum \limits_{i = 1}^{K} (w_{1} p_{i1} + w_{2} p_{i2} + \cdots )\ln \left( {w_{1} p_{i1} + w_{2} p_{i2} + \cdots } \right)$$
    (19)

    Assuming equal weights \({w}_{1}={w}_{2}={w}_{k}=1/K\), the effective number of criteria (true gamma diversity of order one) is \(D_{\gamma } = \exp H_{\gamma }\). The difference between \({H}_{\gamma }\) and \({H}_{\alpha }\) is the beta diversity \({H}_{\beta }\), which is equivalent to the true beta diversity of order one as \({D}_{\beta }={D}_{\gamma }/{D}_{\alpha }\). True beta has a maximum diversity equivalent to \(N\); the minimum is one, which means there is no variation between the decision makers. To measure the consensus indicator for group decision makers, a new homogeneity index \(M\), which is a reciprocal of \({D}_{\beta }\), is introduced as \(M={D}_{\alpha }/{D}_{\gamma }\). This can be transformed into a relative index of homogeneity to measure the consensus indicator in the range from zero to unity [49,50,51]:

    $$S = \left( {1/D_{\beta } - D_{\alpha \min } /D_{\gamma \max } } \right)/\left( {1 - D_{\alpha \min } /D_{\gamma \max } } \right)$$
    (20)

If \(D_{\alpha \min } = 1\) and \(D_{\gamma \max } = N\), Eq. (20) can be transformed as follows:

$$S = \left( {1/D_{\beta } - 1/N} \right)\left( {1 - 1/N} \right)$$
(21)

Thus, the relative index of homogeneity \(S\) can be considered as a consensus indicator. When the priorities of all the decision makers are completely distinct, it is zero, and it is one (unity) when the priorities of all the participants are identical. In addition, when decision makers give full preference to one criterion, the alpha entropy is minimum. In that case, the outcome from the pairwise comparisons of \(N\) criteria is equivalent to \(M/(N+M-1)\) for the selected criterion with a remainder of \(\left(N-1\right)\) priorities equal to \(1/(N+M-1)\). Hence, the minimum and maximum alpha entropies for N criteria and K decision makers can be calculated as [49,50,51]:

$$H_{\alpha \min }^{*} = - \frac{M}{N + M - 1}\ln \left( {\frac{M}{N + M - 1}} \right) - \frac{N - 1}{{N + M - 1}}\ln \frac{1}{N + M - 1}$$
(22)
$$H_{\gamma \max }^{*} = \left( {N - K} \right)\left( { - \frac{1}{N + M - 1}\ln \frac{1}{N + M - 1}} \right) - \left( {\frac{K + M - 1}{{N + M - 1}}} \right)In\left( {\left( \frac{1}{k} \right)\left( {\frac{K + M - 1}{{N + M - 1}}} \right)} \right)$$
(23)

The new AHP consensus indicator for effective group decision-making is estimated by transforming Eqs. (22) and (23) into a form similar to that of Eq. (20) by using \({D}_{\alpha }\) and \({D}_{\gamma }\) to keep the indicator in the range from 0 to 1 as follows:

$$S^{*} = \left( {1/D_{\beta } - D_{\alpha \min }^{*} /D_{\gamma \max }^{*} } \right)/\left( {1 - D_{\alpha \min }^{*} /D_{\gamma \max }^{*} } \right)$$
(24)

The AHP consensus indicator \({S}^{*}\) can be interpreted as shown in Table 7.

Table 7 AHP consensus indicator

Steps \(i\) to \(v\) are repeated until a requisite improvement alternative is obtained and implemented. In this study, using an online AHP software, three senior management team members from the train operator (senior project manager, procurement manager, operations manager) and three senior management team members from the PBU manufacturer (assurance manager, mechanical design manager, electrical design manager) constituted the group decision makers [52].

In this section, we presented a systematic guide on how to use the proposed hybrid model to assess the overall risk and reliability and allocate appropriate maintenance strategies at different life cycle stages of the rolling stock subsystem. This approach, particularly the downstream segment, is not based on a one-stop principle. Rather, the framework will achieve maximum benefit if it is adopted as a means of producing continuous improvement in which the overall goal is for components to move as close as possible to the low-low region of the decision map. In the next section, we present a case study in which the proposed framework is implemented using real-life RTS data.

4 Case Study

To demonstrate the applicability and sequence of implementation of the proposed hybrid framework, we consider a case study based on an ongoing project for the design and delivery of new pneumatic brake units (PBUs) to replace the unreliable PBUs in an in-service electrical multiple-unit (EMU) rolling stock (RS) operation for a train operator in the UK. As a measurement of failure rate, the desired operator PBU reliability requirement is \(1.607\times {10}^{-5}\) per hour equivalent to one failure in 62,227 h with a maximum allowable overall risk that should be tolerable, thus, as low as reasonably practicable. The current PBU failure rate was estimated as \(5.558\times {10}^{-5}\) per hour, indicating that there was a failure every 17,992 h of train operation. The existing PBU reliability led to approximate delay impact minutes and service-affecting failures of 1002 min and 203 failures over 3 years. This was far below the operational requirements of one failure in every 62,227 h (i.e.,\(1.607\times {10}^{-5}\) \(\mathrm{per hour})\) of train operation. Delay impact minutes occur when the train fails to recover during the first 3 minutes of technical failure in passenger service, and the associated failure after 3 minutes is referred to as the service-affecting failure. As part of the design process, it is also necessary to establish and allocate all the appropriate maintenance strategies for the PBU components to ensure optimum customer operational requirements that avoid unnecessary train delays due to PBU failures.

4.1 Case Study Background

The RS operation employs dual-voltage EMUs that use 25-kV AC overhead and 750-V DC power supplies from the third rail. An analysis of the 3-year historical data obtained from the MMS revealed that the PBUs represent one of the worst-performing subsystems in terms of service reliability and delay minutes, as shown in Fig. 3. During this period, the maintenance and delay impact minute costs to the operator owing to PBU failure were £3,153,689.34 and £1,056,000, respectively, and exceeded the operator’s penalty cost budgets by 315% and 956%, respectively. The critical components considered for the PBU module comprises a non-return valve (NRV) assembly, a main air compressor (MAC), a flexible delivery hose (FDH), a brake control unit (BCU), and an air filter (AF). The BCU is further divided into a relay valve (RV), check valve (CV), magnet valve (MV), and pressure governor (PG). The PBU stores compressed air in a dedicated reservoir that is protected from the main air reservoir pipe (MRP) pressure losses by a non-return valve (NRV). Fig. 4 shows a high-level schematic of a PBU and its associated components (note not all components are shown). The total annual distance of 150,000 km at an average speed of 24.5 km/h (with a top speed of 60 km/h) for the train is considered.

Fig. 3
figure 3

EMU train subsystem performance based on delay minutes and failure frequency

Fig. 4
figure 4

A simple schematic diagram of a PBU subsystem

4.2 Analysis, Discussion, and Implementation

The process commenced with the collection of historical and operational data from the train operator’s railway rolling stock MMS database as depicted in Fig. 1. The FMECA for a subsystem was first conducted to identify the PBU subsystem FMs along with their corresponding RPNs, failure rates, and their effects on the subsystem as shown in step 1(a) of the model depicted in Fig. 1. The failure rate for each FM was estimated as per Table 8, which was then used to extrapolate the frequency of occurrence \({(O}_{kj})\) for each FM using Table 1. Appendix A sets out the basis for the quantitative FM and the overall failure rate estimations. Furthermore, severity rating, \({S}_{kj},\) for each FM was extrapolated from Table 2. The \({\mathrm{RPN}}_{kj}\) was estimated for each component of the PBU using Eq. (2) based on the established severity and frequency rating values as per Table 8. MAC FMs (overheat compressor and filter passing contaminated air) including FDH (minor atmospheric leakage) and BCU magnetic valve error (no opening) failure modes were noted to have significantly elevated RPNs of eight as depicted in Table 8. Although MAC and FDH FMs have the same RPN, the overall risk and consequence of each component failure is not the same. Next, the overall PBU risk (Tkj) was estimated using the railway risk matrix as per Table 3, following Eq. (3) and by extrapolating the component levels for \({F}_{kj}\) and \({H}_{kj}\). As shown in Table 9, the overall component failure frequency level \({F}_{kj}\) (estimated based on the overall failure rates described in Appendix B) for the MAC and FDH were evaluated as level 3 (remote), followed by AF with level 2 (improbable), and NRV and BCU with level 1 (very unlikely). Similarly, the overall component severity levels \({H}_{kj}\) for subsystems MAC, NRV, FDH, and AF were all evaluated as level 2 (marginal), in which the failure of a given subsystem can lead to the functional reduction of the PBU. In contrast, \({H}_{kj}\) of the BCU was determined to be level 3 (critical) and can potentially cause a complete functional loss of the entire PBU. Thus, at an overall subsystem failure frequency level 3 (remote) and a failure severity level 2 (marginal), the overall PBU risk level \({T}_{kj}\) could be classified as tolerable, indicating that the PBU design has a tolerable risk level that meets the operator’s requirements. The summary of RPN, failure rates, and risk results, as illustrated in steps 1(a) and 1(b) (Fig. 1), are shown in Tables 8 and 9.

Table 8 PBU component failure modes and estimated failure rates
Table 9 Summary of risk assessments for PBU components and overall subsystem

Next, the logical relationship between the top and basic events including the prediction of overall failure rate and reliability by FTA was conducted as per step 2 of Fig. 1. As shown in Fig. 5, the TE, Full-service braking error of the train, can occur whenever any of the five subsystems fail. The FTA was constructed based on the FMECA information from Table 8. The failure rates (for each failure mode) shown in Table 8 were estimated using historical data from the MMS as part of the analysis in step 1 (Appendix B). Thus, the TE is connected by OR-gates to the lower failure events \(({M}_{1}, {M}_{2}, {M}_{3}, {M}_{4}\), and \({M}_{5}).\) The MAC function loss \({M}_{1}\) has two failure causes \(({F}_{1}, {F}_{2})\) and cannot operate if either of the basic events \(({F}_{1} or {F}_{2})\) occurs; therefore, it is also connected to these events via an OR-gate. Similarly, the NRV, FDH, and AF function losses \({M}_{2}\), \({M}_{3}\), and \({M}_{4}\), respectively, are connected to their respective failure causes by OR-gates. The BCU function loss \({M}_{5}\) has four intermediate events \({(M}_{6}, {M}_{7}, {M}_{8}, {M}_{9})\) and fails to function only if all four intermediate events fail simultaneously; therefore, it is connected to these events through an AND-gate. The RV error \({M}_{6}\) occurs if any of its failure causes \(({F}_{9} or {F}_{10})\) occur and, therefore, has an OR-gate connection, as do the CV, MV, and PG errors, \({M}_{7}\), \({M}_{8},\) and \({M}_{9},\) respectively. From Eqs. (4)–(13) and the rules of Boolean algebra [36, 37], a total of 24 minimal cut sets (MCSs) were identified (including eight single-point and 16 quadruple-point failures), and an overall failure rate estimate of \(1.385\times {10}^{-5} \mathrm{per hour}\) (which adequately exceeds the initially prescribed operator requirement of \(1.607\times {10}^{-5}\) \(\mathrm{per hour}\)) was derived as indicated in step 2 of Fig. 1. The overall FTA diagram is shown in Fig. 5. With the expected reliability of one failure in every 72,202 hours of operation compared to the customer’s original requirement of one every 62,227 h (\(1.607\times {10}^{-5}\) \(\mathrm{per hour}\)), a further 16.03% increase in reliability was demonstrated. Moreover, the current failure rate of \(5.558\times {10}^{-5}\) per hour indicates a significant reliability increase of 301%. The increase in reliability can be demonstrated as an equivalent improvement in the delay minutes (from 1003 to 101 min) and a reduction in service-affecting failures (from 203 to 51 failures). Assuming that labour, spare parts, and logistic delay costs remain relatively constant, the forecasted failure rate indicates cost savings of approximately 16.03% over the desired customer reliability target and 301% over the existing unreliable PBU. Even under worst-case economic scenarios whereby high inflation rates are applied to labour, spare parts, and logistic delay costs, the new and enhanced reliability will still yield optimum cost savings over the existing PBU.

Fig. 5
figure 5

Pneumatic brake unit fault tree diagram

Using Eqs. (14)–(16), the intervals for both criteria (failure frequency and delay impact minutes from the MMS data) were determined as shown in step 3(a) of Fig. 1. For the failure frequency criterion, estimated high, medium, and low intervals of {30, 44}, {16, 30}, and {3, 16}, respectively, were obtained; for delay minutes, the corresponding estimated intervals were {64, 79}, {49, 64}, and {5, 49}, respectively. Thus, the frequency criterion range was estimated to be between 3 and 44, while the delay minute range was determined to be between 5 and 49. Figure 6 shows an extract of the decision map for selected components as illustrated in steps 3(b) and 3(c) of Fig. 1. It is seen that AF, NRV, PG, and CV are located within the BM quadrant owing to their low-low combinations; MV and RV fall within the PPM region owing to their medium-low combinations, while BCU has a high-high combination, indicating that the DOM strategy is most appropriate following step 3(d) (Fig. 1) of the proposed model. Considering the criticality of this asset to the client, and the fact that the BCU lies in the DOM region, a further proposal to redesign and improve maintenance routines of the BCU by the RTS project management team using the prescribed AHP decision-making approach for comparison was considered.

Fig. 6
figure 6

PBU components allocated to the decision-making grid

For the final step considering the BCU maintenance improvement as shown in step 4 of Fig. 1, the first three hierarchies were identified: BCU maintenance improvement (goal) as shown in step 4(a) of Fig. 1; availability and cost as criteria from the project team (level two); software upgrade (SU), reduced maintenance periodicity (RMP), and additional brake redundancy (level three) as shown in step 4(b) of Fig. 1. The data for the AHP were obtained through a rigorous group decision process where aggregated weights for criteria and alternatives were established as shown in Appendix B (Tables 11, 12, 13, and 14) using Tables 5, 6, and 7. The synthesised pairwise comparison for the six decision makers was conducted between the two main criteria in terms of priority trade-off using Eqs. (17)–(23), where fleet availability was maximised (69.6%) compared to the minimised cost (30.4%) with CR less than 10% and a very high group consensus at 81.76%, as shown in Appendix B (Table 11). However, further breakdown of the criteria weight aggregation from Appendix B (Tables 12 and 13) shows that all the decision makers regarded availability as the most dominant criteria, except the procurement manager who prioritised cost (66.7%) instead of availability (33.3%). Similarly, the pairwise comparison was conducted for the improvement alternatives using the priority list in Table 6 against the two consolidated criteria as shown in step 4(c) of Fig. 1. The overall result indicated that the SU was the best choice among the decision makers for the BCU maintenance improvement with an overall consolidated global priority of 47.3% compared to ABR with 39.8% and the least viable RMP with 12.7% with CR less than 10% and very high group consensus at 91.4%, as shown in Appendix B (Table 14) using Table 7 as shown in step 4(d) of Fig. 1. Additionally, the results revealed that the SU as consensus choice had a considerable impact on BCU maintenance improvement strategy. The detailed analysis results, including breakdown of the group participants’ decision-making aggregated results and matrices based on each consolidated criterion and alternative by the RTS project team with respect to the BCU improvement, are shown in Appendix B (Tables 11, 12, 13, 14, 15, 16, 17 and 18) (Fig. 7).

Fig. 7
figure 7

Improvement selection hierarchy for the BCU

The results indicate the proposed hybrid model can identify RTS subsystem overall risk level, reliability aspects, and further BCU improvements that boosted the respective asset maintenance strategies toward the desirable low-low region of the decision map. As noted above, the AHP results revealed that SU was the most favoured option and therefore is currently being implemented by the organisation across a fleet of trains. Upon completion of the modification and improvement needed to move the BCU to the desired region of the decision map, maintenance plans and asset strategies can then be stored in the MMS database for access and implementation by engineers and technicians. When new or significant historical data or changes in the use and operation of the train are made available, the model can then be iterated again to improve the reliability, risk, and maintenance of the subsystem as part of the continuous improvement process illustrated in steps 1–4 of Fig. 1. This hybrid approach is not an end in itself; optimal benefits can only be achieved by instituting it as a means of continuous improvement. Furthermore, the overall goal should be constant migration to and retention of all components within or as close as possible to the low-low region of the decision map. Thus, it can be constituted as a model capable of operating across the three main life cycles of RTS.

5 Concluding Remarks

This paper introduced a new hybrid model that considers the upstream elements of reliability and risk assessment as well as the downstream element of maintenance decision-making techniques to improve the performance and maintenance strategy of an RTS PBU subsystem. Unlike previous maintenance improvement strategies that were purely theoretical and therefore subject to bias, in this study, each of the identified improvement actions (i.e., SU, ABR, and RMP) were presented to a team of project stakeholders that represented both the client and supplier to enable effective group decision-making using AHP mathematical techniques by the six senior management team members. The proposed hybrid model offers the following significant advantages relative to previous hybrid models:

  • The model is practical in that it provides optimal cost savings while ensuring RTS reliability and enables risk analysis with respect to the simultaneous allocation of appropriate asset management strategies for systems and subsystems at the design phase of the product life cycle.

  • The model offers considerable benefits in terms of enforcing alignment among different project teams in coordinating their efforts at the early stages of product development. This helps to balance the competing factors of asset performance and risk and maintenance requirements within a single framework, especially in an RTS organisation.

  • More importantly, the model allows for further improvement of subsystems (such as DOM) allocated to the high-high region of the decision map by using the features of multi-criteria decision-making of AHP to select alternatives improvement against criteria to enable the identification of the best enhancement strategy at an optimal cost.

  • Finally, the new hybrid model provides robustness, versatility, and compelling synthesis of practical engineering approaches and academic rigour in evaluating risk, reliability, and maintenance requirements as a single entity at an opportune phase (design) including other life cycle phases for an RTS asset. In this manner, the proposed method provides a robust alternative to RCM and other hybrid reliability and maintenance models.

Although the model provides several advantages, the availability of reliable and quality data, especially for novel systems, and the effects of dynamic interaction between subsystems of a complex system may serve as a limitation for the holistic application of the proposed model. Therefore, further study is recommended for a dynamic hybrid model that considers additional factors such as multiple failures with changing operational conditions for applications in complex systems such as RTS.