1 Background

Subways are one of main modes of urban transport as they are closely associated with passenger daily transportation [1]. However, they are high risk in terms of potential loss of destruction of assets and human life. The main reason for derailment and collision are rail crack incidents (RCI) [2]. Rail cracks can easily lead to shelling defects on the surface of rails, causing track irregularity, derailments and collisions [3]. In Europe, for example, with an estimated annual cost of repairing rail damage of 300 million Euros, hundreds of broken rails are caused by rail cracks [4]. Although many precautions are taken to ensure a reliable and punctual subway service system in Hong Kong, the number of subway incidents is increasing, with RCI being the most increasing cause of mass transit rail (MTR) delay incidents, having risen by 200 % from 2008 to 2010 [5]. In particular, the number of RCI occurring in the first 2 months of 2011 was more than the total number in 2008.

A rail crack is defined by the International Union of Railways (UIC) as a rail which has one or more gaps of no set pattern, apparent or not, the progression of which could lead to a rapid rail breakage, irrespective of the parts of the profile concerned [5]. There are several causes of RCI. Rail crack initiation life is very sensitive to hydrostatic stress, which becomes larger when the wheel load and fiction coefficient increase [7]. Axle load, crack location, crack size and rail metallographic have also been studied to analyse their effects on fatigue crack growth by fracture mechanics [7]. A rail crack growth model has been established, the effects of nine operational environment factors compared, and three factors—thermal tension, track curvature and residual stress-identified as having the most impact [8].

Human errors (HE) play a major part in RCI. For example, welds are the most vulnerable component in the rail [5] and can easily become defective by HE made by designers, manufactures, operators or maintainers (DMOM). That is, design deficiencies caused by the designer, defective weld joints caused by the manufacturer, excessive speed or loads caused by the operator, and rail corrosion caused by poor inspection and maintenance. Previous studies indicate that these skill-based HE can occur at any time [9]. In their respective working contexts, the DMOM are often involved in a sequence of events leading to an incident or accident [10]—poor inspection and maintenance being only the final act of a long and complex chain of organisational and systemic errors.

Although there has been a large amount of research on the identification of casual factors in the field of rail crack management and prevention, most has been based on laboratory test experiments and field tests from a technical perspective. However, although abnormal or unsafe states of material and machinery are the immediate factors for RCI, the DMOM HE mentioned above are the root causes of the incidents. Assessing the impact of HE is difficult with traditional technical approaches, which are focused more on providing an identification or prediction tool based on laboratory test experiments and field tests of special cases [11, 12]. As a result, no empirical studies have yet been conducted to assess the impact of HE on RCI from a management perspective.

To do this involves the development of a complex model to represent the relationships between the human participants involved as each is dynamically affected by the others both directly and indirectly. Fault tree analysis (FTA) can be used for this purpose through providing an understanding of the logic leading to an unwanted event through a top to down deductive failure analysis in which Boolean logic combined a series of lower-level events are used to analysis the undesired state of a system. Although the method has been used in the identification of the main parties in the maritime transport system and their critical activities [10], FTA’s weakness is that it cannot be used to describe the causal relationships among participants and make inferences concerning the probability of events occurring.

Bayesian belief networks (BBN), on the other hand, have been successfully used to integrate the analysis of human and hardware failure in studying the possible association of HE with fire incidents in subway operation [13] and evaluating the effects of organisational factors in railway operation on signal-passed-at-danger incidents [14]. Both these applications of BBN in railway risk management demonstrate an efficient way of understanding how HE and organisational failure contribute to railway incidents. In doing this, BBN is able to identify possible configurations of events leading to an incident and understand the interactions of the factors involved [10]. BBN represents formalism in the risk analysis domain due to its ability to deal with probabilistic data and model the interdependencies of events by the use of arrow and conditional probabilities [15]. It is also one of the simplest approaches in sensitivity analysis and works well even when the number of factors is relatively small [10].

Hence, the primary purpose of this paper is to assess the impact of HE from DMOM on RCI and provide a means of identifying their sources. The BBN approach is used to aid this process and is described in the following section prior to an illustrative application in the form of a case study of 14 recent MTR RCI in Hong Kong.

2 Research Methodology—The Bayesian Network Approach

The term “Bayesian Network” (BN) was coined by Pearl in 1985 [16], which is a directed acyclic graphical model or belief network. A set of random variables and their conditional dependencies of this probabilistic graphical model was represented by a directed acyclic graph (DAG). The probabilistic relationships can be represented by BN. The network can be used to compute the probabilities of the presence of various faults once the symptoms involved given. The nodes represent variables and are conditionally independent of each other. The edges represent conditional dependencies, each node being associated with a probability function that takes as input a particular set of values for the node’s parent variables and gives the probability of the variable represented by the node. The corresponding states are reflected by a conditional probability table (CPT) as exemplified in Fig. 1.

Fig. 1
figure 1

Typical Bayesian network and conditional probability table

In the discrete case, Bayes’ theorem relates to the conditional and marginal probability of events X and Y, provided the posterior probability of Y does not equal zero:

$$ P(Y|X) = \frac{P(X|Y)\, \cdot \,P(Y)}{P(X)}. $$
(1)

In Bayes’ theorem, each probability has a conventional name:

  • P(X) is the prior probability of occurrence of X provided by statistical analysis of historical data, assessment of experts or predictive model based on past data if it does not have parent nodes;

  • P(Y) is the marginal probability ignoring the states of X, estimated by Bayesian theory;

  • P(Y|X) is the distribution of occurrence of Y given the occurrence of X; and

  • P(X|Y) is the distribution of occurrence of X given the occurrence of Y; the value of P(X|Y) can be translated into “given a rail crack incident, what is the likelihood that it has occurred due to HE of the designer, manufacturer, operator or maintainer?”

3 Case Study: Hong Kong MTR Rail Crack Incidents

The BN approach is applied to a set of RCI that occurred on Hong Kong’s MTR system over the 2008 to 2011 period. This aims to (1) identify the participants among the DMOM contributing most to the RCI by human error, and (2) provide effective strategies for the risk management of the RCI by sensitivity analysis. Although the size of the sample data is small because rail crack is an unusual incident happened in MTR, human error risk analysis of this type incident is still interested by government, manager and passenger.

The case study is carried out is three steps comprising incident analysis, qualitative model formulation and sensitivity analysis [17]. The background of the collected incidents is introduced in the Incident Analysis step; a qualitative model formulation is established based on functions in the DMOM analysis and relationship analysis among these functions in the second step and an importance model and sensitivity analysis model are presented in the last step.

3.1 General Description of the RCI

The primary causes of the RCI are summarised in Table 1. All the data used in the research were collected from rail investigation reports on the website of the Hong Kong legislation council and contain 14 RCI. Of these, two occurred in 2008, six in 2010, and three in 1 month in 2011. Only 30 % of RCI were found during general maintenance check-up.

Table 1 Root causes of RCI

It is also noted that, although RCI on the MTR that lead to isolated transverse fractures are less likely to cause train derailments, transverse fractures cause many other costs in inspections, train delays, remedial treatments, pre-treatments, derailments and loss of business confidence and customer support [18].

3.2 Qualitative Model Formulation and the BN Mode

At this stage, a qualitative model formulation is first established based on literature review and consultation with MTR operators, engineers and maintenance staff. Researchers have focused mainly on identifying the technical causal factors related to RCI, such as axle load, vehicle speed and traffic density [6]. However, an accident or incident is a consequence of a sequence of HE and associated unsafe behaviour [19]. The technical causal factors identified and classified are used to match the HE of DMOM with the technical failure in the RCI report [6].

Jeong’s functions allow a better understanding and clarification to be obtained of the duties of the four DMOM participants. By analysing the MTR RCI, it is found that most incidents are caused not by HE but by a causal sequence [20]. For example, if corrosion is the direct reason for rail degradation, it may be that the problem should have been considered by the designer and therefore the designer’s ignorance of the problem is the indirect reason for corrosion. Daily maintenance inspection also determines the occurrence or not of RCI.

As there are many indirect reasons for RCI, translating all the cause–effect relationships involved into BBN would require a great amount of incident cases to analyse. Therefore, in view of the small number of samples collected, both direct and indirect reasons are categorised into four broad categories denoting the different participants involved, namely Designer (Des), Manufacturer (Man), Operator (Ope) and Maintainer (Mai) for establishing a qualitative model. Hence, there are four causal nodes in the BN model. Also, two symptom nodes are needed depending on whether there is a material defect-related stress concentration (MatSC) or non-material defect stress concentration (Non-MatSC). A qualitative model established by FullBNT in MATLAB is shown in Fig. 2 (the computer program is provided in Appendix), in which the relationships between variables are represented by the arcs in the BN.

Fig. 2
figure 2

Qualitative model formulation and the Bayesian network model

3.3 Importance and Sensitivity Analysis

3.3.1 Prior and Conditional Probabilities

There are two ways of assessing the prior and conditional probabilities: objective-based prior probabilities and subjective-based prior probabilities, which should be used depending on whether the probability distribution of the occurrence of the factors can be obtained from the data. The objective-based prior probabilities magnify the uncertainty of the occurrence of the events. Therefore, as the RCI cases being analysed are collected from the Legislation Council in Hong Kong, the prior and conditional probability analyses are conducted based on the subjective method.

The prior probability is defined as the frequency or count of the occurrences of the cause and symptom events within the collected samples, and P(X) equals the number of HE divided by the number of RCI. There are two possible values for each event (H = occurs, N = does not occur). The prior probabilities of HE of designers (Des-HE) and manufacturers (Man-HE) are the occurrence frequencies of Des-HE and Man-HE before the evidence is taken into account. The prior probability distribution is a necessary input in calculating the marginal probability and posterior probability. As is shown in Table 2, the prior probability of node Des-HE is obviously lower than node Man-HE, which means that the probability of Des-HE is less than that of Man-HE. Because there is not enough case, the paper had to use the same case for prior probability and BN model analysis.

Table 2 Prior probabilities of nodes Des-HE and Man-HE

The conditional probability is the probability of event X, given the occurrence of another event Y and is written in the form of P(X|Y). The HE of the operator (Ope-HE) is induced not only by the knowledge and skill of the operators but also by the Des-HE or Man-HE. According to the relationships in the collected cases, the conditional probabilities of Ope-HE P(Ope-HE|Des-HE) are equal to the joint probabilities of Ope-HE and Des-HE P(Des-HE,Ope-HE) divided by the probabilities of the Des-HE. The conditional probabilities of the HE of maintainer (Mai-HE) given the occurrence of Ope-HE, P(Mai-HE|Ope-HE) is calculated by the same approach (see Table 3).

Table 3 Conditional probabilities of nodes Ope-HE and Mai-HE

P(Ope-HE = H|Des-HE = H) = 0.3333 means that the occurrence probability of Ope-HE is 0.333 when Des-HE occurs. When Des-HE does not occur, the occurrence probability of Ope-HE is 0.182. Therefore, we conclude that the occurrence of Ope-HE is induced not only by Des-HE but also by other events. Consider Mai-HE, which definitely occurs once Ope-HE occurs, because the P(Mai-HE = H|Ope-HE = H) equals 100 %. Now we cannot say that Mai-HE is caused by Ope-HE, because it also has a probability of 0.455 when Ope-HE does not occur. The same interpretation can be used for Non-MatSC, MatSC and RCI (see Tables 4 and 5).

Table 4 Conditional probabilities of Non-MatSC
Table 5 Conditional probabilities of MatSC and RCI

3.3.2 Importance Analysis Based on Bayesian Inference

In order to identify which participant has the most effect on the occurrence of RCI, an importance analysis is conducted by Bayesian inference. The marginal probability of HE of operator O-HE is

$$ P\left( {O - HE} \right) = P\left( {Man - HE = H;Ope - HE = H} \right) \, + \,P\left( {Man - HE = N;Ope - HE = H} \right)\, + \,P\left( {Man - HE = H;Ope - HE = N} \right) \, + \,P\left( {Man - HE = N;Ope - HE = N} \right). $$
(2)

This is used to calculate the marginal probabilities of causal nodes before the Bayesian inference. The joint probabilities are given by

$$ P\left( {Man - HE;Ope - HE} \right) = P(Man - HE)\, \cdot \,P\left( {Man - HE|Ope - HE} \right) $$
(3)

and hence,

$$ P\left( {O - HE} \right) = \mathop \sum \nolimits P(Man - HE)\, \cdot \,P\left( {Man - HE|Ope - HE} \right) . $$
(4)

The same kind of analysis can also be carried out for the marginal probabilities of Mai-HE, Non-MatSC, MatSC and RCI as shown in Table 6, which provides the initial risk information involved.

Table 6 Marginal probabilities

This shows Mai-HE to be the highest risk (57.14 %), followed by Man-HE (50 %) occurrence probability, while D-HE and O-HE are much lower, with the risk of defects falling between the two groups. While this suggests that Mai-HE and Man-HE might contribute more in leading to an incident, this is not necessarily the case. This is determined by the importance of the causal event—defined as the contribution of the event to the incident as represented by the posterior probability in BBN [21], where posterior probability P(X|Y) means that “given a result event Y, what is the likelihood that it is induced by causal event X?” The posterior probability of Des-HE, given the occurrence probability of a rail crack (RCI), is calculated by

$$ I\left( {Des - HE} \right) = P\left( {Des - HE |RCI} \right) = \frac{P(Des - HE;RCI)}{P(RCI)}, $$
(5)

where P(RCI) is the marginal probability of RCI; P(Des-HE; RCI)/P(RCI) is the posterior probability given that a RCI occurred and P(Des-HE; RCI) is the joint probability that Des-HE and RCI occur together. We define P(Des-HE| RCI) as the importance of Des-HE on the basis of its influence on RCI. The calculation results are shown in Table 7, which are provided by FullBNT in MATLAB (the computer program is in Appendix).

Table 7 Importance of causal events

Here, the importance degree of Non-MatSC is higher than MatSC. In other words, non-material defect-caused stress has a higher contribution to RCI than material defect-caused stress. In terms of the causal events, Mai-HE and Man-HE have the same probability of occurrence. However, Mai-HE is more important than Man-HE due to its higher contribution to RCI. Therefore, when comparing the HE of the designer and operator, although their probability of occurrence is the same, importance analysis indicates that the human error of the operator has more impact than that of the designer. That is to say, although they have the same probability of occurrence, their impacts on RCI are different.

As Table 6 shows, the HE of the maintainer provides the greatest contribution to RCI among the DMOM. This result coincides with what happens in practice as maintenance inspection is the last step prior to the occurrence of RCI. The second important contribution to RCI, human error of the manufacturer, is due to material defects caused by welding and rail manufacture. Although the importance of Man-HE is lower than Mai-HE, it still plays a larger role in contributing to RCI. Des-HE, in contrast, makes the smallest contribution.

3.3.3 Sensitivity Analysis

A sensitivity analysis of the RCI is carried out on four causal factors (Des-HE, Man-HE, Ope-HE and Mai-HE) to gauge the robustness of the results and understand how changes in the causal factors influence the probability of occurrence of RCI. A variance-based method of probabilistic sensitivity analysis is used. The approach is

$$ {\text{S}}\left( {\varOmega_{i} } \right) = \frac{{\Delta P\left( {RCI} \right)}}{{\Delta P\left( {\varOmega_{i} } \right)}}\, = \,\frac{{\Delta P\left( {No{\text{n}} - {\text{MatSC}},{\text{MatSC}},{\text{RCI}}} \right)}}{{\Delta P\left( {\varOmega_{i} } \right)}}\,\, = \,\frac{{\Delta P(\varOmega_{1,i} ,\overline{{\varOmega_{1,i} }} )\, \cdot \,P(\varOmega_{2,i} ,\overline{{\varOmega_{2,i} }} )}}{{\Delta P\left( {\varOmega_{i} } \right)}}\, \cdot \,A, $$
(6)

where

$$ A = P(\varOmega_{1,i} ,\overline{{\varOmega_{1,i} }} |Non - MatSC)\, \cdot \,P(\varOmega_{1,i} ,\overline{{\varOmega_{1,i} }} |MatSC)\, \cdot \,P(RC|Non - MatSC,MatSC) $$
(7)

is a constant, \( \varOmega_{1,i} \) is the set of non-material caused stress, \( \overline{{\varOmega_{1,i} }} \in \) is the complementary set of \( \varOmega_{1,i} \), \( \varOmega_{2,i} \) is the set of material caused stress, \( \overline{{\varOmega_{2,i} }} \in \) is the complementary set of \(\varOmega_{2,i}\) and \( {\text{S}}\left( {\varOmega_{i} } \right) \) is a relative indicator representing the sensitivity of RCI to the probability of HE from the four participants. This framework provides 4 × 4 experiments with four causal factors in four states. The \( {\text{S}}\left( {\varOmega_{i} } \right) \) results are shown in Table 8 for ±20 and ±10 % of the initial P0(\( \Omega _{i} \)) value.

Table 8 Sensitivity of RCI to changes in \( P_{0} \left( {\varOmega_{i} } \right) \) values

This shows that RCI are most sensitive to the probability of Des-HE, with the sensitivity becoming sharper as the probability of Des-HE increases. It also indicates that Des-HE, although having the smallest contribution, has the greatest marginal utility on RCI. Design is the first stage in the life of a rail so that any defects occurring in this stage affect the rail state of the following three stages, involving additional work by the manufacturer, operator and maintainer. Therefore, the greatest marginal utility coincides with the case in practice. The same result applies to Man-HE but with less sensitivity than Des-HE. Figures 3 and 4 summarise the results.

Fig. 3
figure 3

Sensitivity of RCI to Des-HE

Fig. 4
figure 4

Sensitivity of RCI to Man-HE

Unlike the Des-HE and Man-HE, the \( {\text{S}}\left( {\varOmega_{i} } \right) \) of Ope-HE and Mai-HE decreases as \( {\text{P}}_{0} \left( {\varOmega_{i} } \right) \) increases as the increase in probability of Ope-HE and Mai-HE cannot induce more RCI. As Figs. 5 and 6 show, the Mai-HE is more sensitive than the Ope-HE. In fact, the contribution of Mai-HE to RCI is the most of the DMOM, which indicates that the greatest benefits will be obtained by reducing human maintenance error.

Fig. 5
figure 5

Sensitivity of RCI to Ope-HE

Fig. 6
figure 6

Sensitivity of RCI to Mai-HE

4 Conclusions

RCI increase as the subway network becomes more complex and important to community life. In seeking to deliver more effective strategies for risk mitigation, therefore, it is most important to identify the major influencing factors involved. This paper proposes a new method of doing this through the use of BN in developing better risk identification models of RCI. This is particularly useful when the HE of different participants is a crucial issue, as it can deal efficiently with small samples and clarify the causal relationships between the associated latent and observed variables/factors. A case study demonstrates the use of the method for all the RCI occurring in Hong Kong’s MTR system for the period 2008 to 2011, including the HE of the four participants of designers, manufactures, operators and maintainers.

The results confirm that, firstly, the maintenance stage is crucial for RCI risk reduction as mistakes at this stage contribute over 70 % to RCI. Secondly, factors with a higher probability of occurrence contribute more in leading to the incident. Thirdly, RCI is most sensitive to the designers’ probability of HE. Because design is the first stage in the life of the rail, any defects that occur in this stage can induce subsequent mistakes in the following stages. Fourthly, in contrast with the operation and maintenance stages, efforts in improving the design and manufacture stages have a greatest marginality utility.

Importantly, the identification of ability of maintainer in the case study as the most important factor influencing the probability of RCI implies the priority need to strengthen the maintenance management of the MTR system and that improving the inspection ability of the maintainer is likely to be an effective strategy for RCI risk mitigation.

However, this study also has its limitations. First, the qualitative model framework is not sufficiently exhaustive to reflect the real sequence of causal relationships of RCI. Because RCI is an unusual incident happened in MTR, the number of cases is too small to conduct a basic causal factor analysis. Second, the importance analysis is conducted using a single event (such as rail crack), and so does not take into account any other events. Third, sensitivity analysis simply observes the quantitative variation of RCI risk in terms of four causal factors, and does not consider the economic impact of RCI or the cost of improvement. Further studies are needed to address these deficiencies. Despite all this, the research framework and methodology is quite general and clearly suitable for use as a support tool for risk management and decision-making processes in a wide variety of applications beyond RCI.