6.1 Introduction

Reliability of systems is of crucial importance in all aspects of human life. Systems are understood to be groupings of components in specific structure, with the system functioning depending on the functioning of the components and the system structure. Uncertainty about the functioning of components leads to uncertainty about the system reliability. To study how system reliability depends on the reliability of components, several leading methods from the engineering literature are briefly introduced in Sect. 6.2. In situations of uncertainty about reliability in engineering, appropriate statistical methods are required to deal with the specific nature of data. This chapter provides an overview of such basic methods. Section 6.3 introduces the key statistical concepts, basic statistical models are presented in Sect. 6.4. Throughout the emphasis is on explanation of the likelihood function, which is at the heart of most statistical inference approaches as commonly used in reliability applications. Unknown model parameters can conveniently be estimated by maximisation of the likelihood function, with corresponding theory to assess the uncertainty of the estimates. Bayesian methods for statistical inference are based on the likelihood function as well, hence understanding of the likelihood function is crucial for study of system reliability under uncertainty. Stochastic process models are also crucial to describe variable reliability over time, for example, to reflect the effects of maintenance on a system’s reliability. A short introduction to such models is presented in Sect. 6.5, again with an emphasis on deriving the likelihood function to enable statistical inference. As this chapter brings together generic introductory material, no specific references are included throughout the text. Instead, the chapter is ended with a brief list of useful resources, pointing to a number of important books which are highly recommended for further reading, and some brief comments about journals in the field.

6.2 System Reliability Methods

When modelling systems there are a number of tools, ranging from combinatorial methods including reliability block diagrams, fault tree analysis and event tree analysis, to more complex methods that cater for a greater range of system characteristics including dependencies, e.g. Markov or Monte Carlo simulation methods. Note when modelling systems where repair times are involved, these typically do not follow the Exponential distribution, e.g. the Lognormal or Weibull distributions may be more suitable. Also including maintenance teams can mean dependencies are introduced via prioritising strategies. In such instances, more complex methods are required. A brief introduction to some of these modelling methods is provided in this section.

6.2.1 Fault Tree Analysis

One of the most common quantification techniques is fault tree analysis. This method provides a diagrammatic description of the various causes of a specified system failure in terms of the failures of its components. The choice of the system failure mode often follows from a failure mode and effects analysis (FMEA). There is the assumption of independence of failures of the components. Logical gates are used to link together events (intermediate events shown as rectangles and basic events representing failure events as circles), where the more common gates include AND or OR, shown in Fig. 6.1, with an example tree shown in Fig. 6.2. Evaluation of the tree using Boolean algebra yields minimal cut sets, denoted as \(C_i\), which are failure combinations of components that are necessary and sufficient to cause failure of the system. Application of kinetic tree theory and the inclusion-exclusion principle (Eq. 6.1) enables the system unavailability (\(Q_{sys}\)) performance measure to be calculated, where \(P(C_i)\) is the probability of failure of minimal cut set \(C_i\).

Fig. 6.1
figure 1

Common fault tree gate types

Fig. 6.2
figure 2

Example fault tree structure

$$\begin{aligned} Q_{sys}= & {} \sum _{i=1}^{N_c} P(C_i) - \sum _{i=2}^{N_c} \sum _{j=1}^{i-1} P(C_i\cap C_j) + \sum _{i=3}^{N_c} \sum _{j=2}^{i-1} \sum _{k=1}^{j-1} P(C_i\cap C_j\cap C_k) - \ldots \nonumber \\&+ (-1)^{N_c+1} P(C_1\cap C_2\cap \ldots \cap C_{N_c}). \end{aligned}$$
(6.1)

If the unavailability of the system does not meet performance acceptability criteria, then the system must be redesigned. An indication of where to make changes in the system can be achieved by generating component importance measures. This is a measure of the contribution that each component makes to the system failure. One such measure is the Fussel–Vesely measure of importance (\(I_{FV_i}\)), defined as the probability of the union of the min cut sets containing each component given that the system has failed, as shown in Eq. 6.2.

$$\begin{aligned} I_{FV_i} = \frac{P\left( \bigcup \{C_j | i \in C_j\}\right) }{Q_{sys}} \end{aligned}$$
(6.2)

6.2.2 Fault Tree Extensions: Common Cause Failures

Typically, fault trees have only a limited capability to cater for dependencies within systems. One example is that of common cause failures. Safety systems, for example, often feature redundancy, incorporated such that they provide a high likelihood of protection. However, redundant sub-systems or components may not always fail independently. A single common cause can affect all redundant channels at the same time, examples include ageing (all made at the same time from the same materials), system environment (e.g. pressure or stress related) and personnel (e.g. maintenance incorrectly carried out by the same person). There are several methods to analyse such occurrences, including beta factor, limiting factor, boundary method and alpha method. The beta factor method assumes that the common cause effects can be represented as a proportion of the failure probability of a single channel of the multi-redundant channel. Hence, it assumes that the total failure probability (\(Q_T\)) of each component is divided into two contributions: (i) the probability of independent failure, \(Q_I\), and, (ii) the probability of common cause failures, \(Q_{CCF}\). The parameter \(\beta \) is defined as the ratio of the common cause failure probability to the total failure probability, as shown in Eq. 6.3.

$$\begin{aligned} \beta = \frac{Q_{CCF}}{Q_{CCF}+Q_{I}} = \frac{Q_{CCF}}{Q_{T}}. \end{aligned}$$
(6.3)

There are further extensions to traditional fault tree analysis to cater for a greater range of dependencies, e.g. dynamic fault trees. As the nature of the dependency becomes more complex, other modelling methods will be more suitable. Other types of dependency include standby redundancy, where the probability of failure of the redundant component may change when it starts to function and experience load, hence it is dependent on the operating component and hence failure of both is not statistically independent. Another form is multiple state component failure modes, where a component can exist in more than one state, hence being mutually exclusive cannot be considered in the fault tree (not failing in the open mode does not mean that it works successfully). In such instances, using Markov modelling methods may be desirable. Given the state-space explosion that can exist when using Markov approaches, if there are just small subsections of the system exhibiting the dependency it may be possible to analyse these subsections with Markov methods and use the result embedded in the fault tree approach.

6.2.3 Phased Mission Analysis

As systems become more complex, they can be required to undertake a number of tasks, typically in sequence. An example might include an aircraft flight, where it is required to taxi from the stand, take off, climb to the required altitude, cruise, descend, land and taxi to new destination stand. The collection of tasks can be referred to as a mission, where each task is denoted by a phase which has an associated time period. Mission success requires successful completion of all phases and in addition there may be different consequences resulting from the failure in each phase. For each phase, an appropriate modelling technique can be employed to assess its reliability or availability. When the systems phases are non-repairable then fault tree analysis can be used to assess mission and phase success. For non-repairable scenarios, Markov or Petri nets can be used. The parameter of interest is the mission reliability. It is not appropriate to analyse the reliability of each phase and multiply these together to get the mission reliability because (i) the phases are not independent (i.e. failure in one phase may influence failure in another phase); (ii) the assumption that all components are working at the start of the phase is not correct and (iii) the system can fail on a phase change. For the non-repairable case, the initial step is to construct a mission fault tree. The general form is as shown in Fig. 6.3. The top event is a mutually exclusive OR gate, indicating that only one of the input events must happen to cause the output event (mission failure). When considering failures in phase 2 onwards, you need to include in the fault tree that the mission has been successful up to this point, namely, that it has functioned in the preceding phases. For this reason, you can see the introduction of the NOT gate, i.e. under the intermediate event ‘Functions in Phase 1’. Alongside this the failure of the component in earlier phases also needs to be taken into account, hence the component failure is represented as shown in Fig. 6.4. To perform the analysis, both qualitatively and quantitatively, new algebra is required as shown in Fig. 6.5. \(C_j\) corresponds to the failure of component C in phase j, where the bar above \(C_j\) corresponds to the working state. Considering the success of previous phases \(i=1,\ldots ,j-1\), for failure in phase j, makes the analysis non-coherent, yielding prime implicant sets from a qualitative analysis (necessary and sufficient combinations of events (success and failure)). The size of the problem to be solved can (and usually does) become prohibitively expensive, so approximations are required; these are usually based on coherent approximations of the non-coherent phases (i.e. conversion of prime implicants to minimal cut sets). Approximate quantification formulae (e.g. Rare Event, Minimal Cut Set Upper Bound) can then be used.

Fig. 6.3
figure 3

Mission failure fault tree

Fig. 6.4
figure 4

Revised component representation

When analysing repairable systems, there are two requirements for mission success: (1) the system must satisfy the success requirements throughout each phase period and (2) at the phase change times the system must occupy a state which is successful for both phases involved. The second point implies that we will consider failures on phase transition when calculating mission reliability. Analysis with repairable systems is very similar in terms of generating the mission model and phase models. This can be achieved using Markov or Petri Net methods. An illustration of a Petri Net example is shown in Fig.  6.6, where the circles represent places and the rectangular boxes represent transitions. Mission and phase reliabilities can be obtained from analysis of the model.

Fig. 6.5
figure 5

Additional algebra

Fig. 6.6
figure 6

Petri Net representation of a phased mission with three phases

6.3 Basic Statistical Concepts and Methods for Reliability Data

Consider a random quantity \(T>0\), often referred to as a ‘failure time’ in reliability theory, but it can denote any ‘time to event’. Relevant notation for characteristics of its probability distribution includes the cumulative distribution function (CDF) \(F(t)=P(T\le t)\), the survival function \(S(t)=P(T>t)=1-F(t)\) (also called the reliability function and denoted by R(t)), the probability density function (PDF) \(f(t) = F'(t)=-S'(t)\) and the hazard rate \(h(t)=f(t)/S(t)\). The hazard rate has a possible interpretation with conditioning on surviving time t, for small \(\delta t >0\), \(h(t)\delta t \sim P(T\le t+\delta t \, | \, T>t)\). Harder to interpret, but also of use, is the cumulative hazard function (CHF) \(H(t)=\int _0^t h(x)dx\). Assuming \(R(0)=1\), we get \(H(t) = \int _0^t \frac{f(x)}{S(x)}dx = -\ln S(t)\), so \(S(t)=\exp \{-H(t)\} = \exp \{-\int _0^t h(x)dx\}\).

A constant hazard rate, \(h(t)=\lambda >0\) for all \(t>0\), gives \(S(t)=e^{-\lambda t}\), the Exponential distribution. This can be interpreted as modelling ‘no ageing’, that is, if an item functions, then its remaining time until failure is independent of its age. This property is unique to the Exponential distribution. An increasing hazard rate models ‘wear-out’, roughly speaking this implies that an older unit has shorter expected residual life, and decreasing hazard rate models ‘wear-in’, implying that an older unit has greater expected residual life. It is often suggested that ‘wear-out’ is appropriate to model time to failure of many mechanical units, whereas electronic units’ times to failures may be modelled by ‘wear-in’. In a human-life analogy, we can perhaps think about ‘wear-in’ as modelling time to death at very young age (‘infant mortality’) and ‘wear-out’ as modelling time to death at older age, with a period in between where death is mostly ‘really random’, e.g. caused by accidents. This ‘human-life analogy’ should only be used for general insight, and is included here as engineers often claim that ‘typical hazard rates’ for components over their entire possible lifetime are decreasing early on, then remain about constant for a reasonable period, and then become increasing (‘bath-tub shaped’).

A popular parametric probability distribution for T is defined by the hazard rate \(h(t) = \alpha \beta (\alpha t)^{\beta -1}\), for \(\alpha , \beta >0\). This leads to \(S(t) = \exp \left\{ -(\alpha t)^{\beta }\right\} \), and is called a Weibull distribution with scale parameter \(\alpha \) and shape parameter \(\beta \). This distribution is often used in reliability, due to the simple form for the hazard rate. For example, with \(\beta =2\) it models ‘linear wear-out’ (‘twice as old, twice as bad’).

An interesting aspect of reliability data is that these are often affected by censoring, in particular, so-called right-censoring. This means that, instead of actually observing a time at which a failure occurs, the information in the data is a survival of a certain period of time without failing. Clearly, such information must be taken into account, as neglecting it would lead to underestimation of expected failure times.

Two main statistical methodologies use the likelihood function, namely, Bayesian methods and maximum likelihood estimation. Hence, derivation of the likelihood function is an important topic in reliability inference. Let \(t_1,\ldots ,t_n\) be observed failure times, and \(c_1,\ldots ,c_m\) right-censored observations. For inference on a parameter \(\theta \) of an assumed parametric model, the likelihood function based on these data is \(L(\theta | t_1,\ldots ,t_n; c_1,\ldots ,c_m) = \prod _{j=1}^n f(t_j|\theta ) \prod _{i=1}^m S(c_i|\theta )\). This actually requires the assumption that the censoring mechanism is independent of the data distribution, if that is not the case the dependence would need to be modelled. It is also possible to consider the likelihood over all possible probability distributions, so not restricting to a chosen parametric model. In this case, the maximum likelihood estimator is the so-called Product-Limit (PL) estimator, presented by Kaplan and Meier in 1958. The theory of counting processes also provides a powerful framework for nonparametric analysis of failure time data, based on stochastic processes and martingale theory. A well-known result within this theory is the Nelson–Aalen estimator for the CHF, which can be regarded as an alternative to the PL estimator.

6.4 Statistical Models for Reliability Data

This is a very wide topic, we can only mention a few important models. We first consider regression models for reliability data. Regression models are generally popular in statistics, and also very useful in reliability applications, where often Weibull models are used, with the survival function depending on a vector of covariates x, and given by \(S(t;x) = \exp \left\{ -\left( \frac{t}{\alpha _x} \right) ^{\eta _x} \right\} \). Some simple forms are often used for the shape and scale parameters as functions of x, e.g. the loglinear model for \(\alpha _x\), specified via \(\ln \alpha _x = x^T\beta \), with \(\beta \) a vector of parameters, and similar models for \(\eta _x\). The statistical methodology is then pretty similar to general regression methods, and implemented in statistical software packages. Such models need to be fully specified, so are less flexible than nonparametric methods, but they allow information in the form of covariates to be taken into account.

Semi-parametric models enable covariates to be taken into account, but do so without fully specifying a parametric model, keeping some more flexibility. Usually, a parametric form for the effect of the covariates on a nonparametric ‘baseline model’ is assumed. Most famous are the Proportional Hazards (PH) models, presented by Cox in 1972. Here, the hazard rate for covariates x is defined by \(h(t;x) = h_0(t)\psi _x\), with \(h_0(t)\) the baseline hazard rate (normally left unspecified, so nonparametric), and \(\psi _x\) some positive function of x, independent of time t (normally a fully parametric form is assumed for \(\psi _x\)). The name of such models results from the fact that \(\frac{h(t;x_1)}{h(t;x_2)} = \frac{\psi _{x_1}}{\psi _{x_2}}\), independent of t, so the hazard rates corresponding to different covariates are in constant proportion. For these models, \(\ln S(t;x) = - \int _0^t h(u;x)du = -\psi _x \int _0^t h_0(u)du = \psi _x \ln S_0(t)\), so \(S(t;x) = \left[ S_0(t) \right] ^{\psi _x}\). These models are used most for survival data in medical applications, but are also common and useful in reliability. As there are no assumptions on the form of the baseline hazard rate, they provide a valuable method to compare the effect of the covariates. An often used PH model is the linear PH model, with \(\psi _x=\exp \{x^T\beta \}\), with \(\beta \) a vector of parameters. We now describe the analysis of this particular model, with no further assumptions on \(h_0(t)\). The goal is to estimate \(\beta \) and \(R_0(t)\), the baseline survival function related to \(h_0(t)\). This is far from trivial, as it is not clear how the likelihood function can be derived, since this is neither uniquely defined by a fully specified parametric model, nor completely free as was the case for fully nonparametric models (leading to the PL estimate). Hence, we need to use a different concept.

Suppose we have data on n items, consisting of r distinct event times, \(t_{(1)}<t_{(2)}<\ldots <t_{(r)}\) (the case of ties among the event times is a bit more complicated), and \(n-r\) censoring times. Let \(R_i\) be the risk set at \(t_{(i)}\), so all items known to be still functioning just prior to \(t_{(i)}\). We can now estimate \(\beta \) via maximisation of the ‘likelihood function’:

$$\begin{aligned} L(\beta ) = \prod _{i=1}^r \frac{ \exp \left\{ x_{(i)}^T\beta \right\} }{\sum _{l\in R_i} \exp \left\{ x_l^T\beta \right\} } \end{aligned}$$
(6.4)

with \(x_{(i)}\) the vector of covariates associated to the item observed to fail at \(t_{(i)}\), etc. There have been many justifications for \(L(\beta )\), the nicest is the original one by Cox, which is as follows. Let us consider \(R_i\) at \(t_{(i)}\). The conditional probability that the item corresponding to \(x_{(i)}\) is the one to fail at the time \(t_{(i)}\), given that there is a failure at \(t_{(i)}\), is equal to

$$ \frac{h(t_{(i)};x_{(i)})}{\sum _{l\in R_i} h(t_{(i)};x_l)} = \frac{ \exp \left\{ x_{(i)}^T\beta \right\} }{\sum _{l\in R_i} \exp \left\{ x_l^T\beta \right\} }. $$

Now \(L(\beta )\) is formed by taking the product of all these terms over all failure times, giving a ‘likelihood’ which is conditional on the event times, sometimes called the ‘conditional likelihood’ (aka ‘partial likelihood’ or ‘marginal likelihood’). Note that the actual event times \(t_{(i)}\) are not used in Eq. 6.4, just the ordering related to the values of the covariates. This relates to the fact that we do not have any knowledge or assumptions about \(h_0(t)\). Large sample theory is available for \(L(\beta )\), allowing estimation and hypothesis testing similarly as for standard maximum likelihood methods.

Next, one must consider estimation of the survival function. Once the estimate for \(\beta \) has been derived, let us denote this by \(\hat{\beta }\), it is possible to obtain a nonparametric estimate of the baseline survival function. Let

$$ \hat{S}_0(t) = \prod _{j: t_{(j)}\le t} \hat{\alpha }_j, $$

where the \(\hat{\alpha }_j\)’s are derived via

$$ \alpha _j^{\lambda _j} = 1 - \frac{\lambda _j}{\sum _{l\in R_j} \lambda _l}, $$

and

$$ \lambda _j = \exp \left\{ x_j^T\beta \right\} , $$

and taking \(\beta =\hat{\beta }\). This actually gives the maximum likelihood estimate for the survival function, under the assumption that \(\beta \) is indeed the given estimate \(\hat{\beta }\).

6.5 Stochastic Processes in Reliability—Models and Inference

Suppose we have a system which fails at certain times, where there may be some actions during this period which affect failure behaviour, e.g. minimal repairs to allow the system to continue its function, or replacement of some components, or other improvements of the system. Let the random quantities \(T_1<T_2<T_3<\ldots \) be the times of failure of the system, and let \(X_i=T_i-T_{i-1}\) (with \(T_0=0\)) be the time between failures \(i-1\) and i. These \(X_i\), or in particular trends in these, are often of main interest in analysis of system failure, e.g. to discover whether or not a system is getting more or less prone to fail over time. Hence, the major concern is often detection and estimation of trends in the \(X_i\). Therefore, we cannot just assume these \(X_i\)’s to be (conditionally) independent and identically distributed (iid), as is often assumed for standard statistical inference on such random quantities. Instead, we need to consider the process in more detail. A suitable characteristic for such a process is the so-called ‘rate of occurrence of failure’ (ROCOF). Let N(t) be the number of failures in the period (0, t], then the ROCOF is

$$ v(t) = \frac{d}{dt} E{N(t)}. $$

An increasing (decreasing) ROCOF models a system that gets worse (better) over time. Of course, all sorts of combinations can also be modelled, e.g. first a period of decreasing ROCOF, followed by increasing ROCOF, to model early failures after which a system improves, followed by a period in which the system wears out. Note that the ROCOF is not the same as the hazard rate (the definitions are clearly different!), although intuitively they might be similar. If we consider a standard Poisson process, with iid times between failures being Exponentially distributed, then the ROCOF and hazard rate happen to be identical.

An estimator for v(t) is derived by defining a partition of the time period of interest, counting the number of failures in each of the intervals of this partition, and dividing this number by the length of the corresponding interval if necessary (i.e. if not all intervals are of equal length). However, it is more appealing to use likelihood theory for statistical inference, which we explain next for nonhomogeneous Poisson processes (NHPP), for which the ROCOF is a central characteristic often used explicitly to define such processes.

NHPP are relatively simple models that can be used to model many reliability scenarios, and for which likelihood-based statistical methodology is well developed and easy to apply. The crucial assumption in these models is that the numbers of failures in distinct time intervals are independent if the process characteristics are known. A NHPP with ROCOF v(t) is easiest defined by the property that the number of failures in interval \((t_1,t_2]\) is a Poisson distributed random quantity, with mean

$$ m(t_1,t_2)=\int _{t_1}^{t_2} v(t)dt. $$

This implies that the probability of 0 failures in interval \((t_1,t_2]\) equals \(\exp \{-m(t_1,t_2)\}\), and the probability of 1 failure in this interval equals \(m(t_1,t_2) \exp \{-m(t_1,t_2)\}\). Of course, if v(t) is constant we have the standard Poisson process. For statistical inference, we wish to find the likelihood function corresponding to a NHPP model, given failure data of a system. Suppose we have observed the system over time period [0, r], and have observed failures at times \(t_1<t_2<\ldots <t_n\le r\), assuming there were no tied observations (so only a single failure at each failure time; else things get slightly more complicated). The likelihood function is derived in a similar way as for iid data, that is, the reasoning of using PDFs at failure times in the likelihood for such data. Let \(\delta t_i>0\), for \(i=1,\ldots ,n\), be very small. The process observed can then be described as consisting of: 0 failures in \((0,t_1)\), and 1 failure in \([t_1,t_1+\delta t_1)\), and 0 failures in \([t_1+\delta t_1,t_2)\), etc., until no failures in \([t_n+\delta t_n,r]\) (this last bit is just deleted if \(r=t_n\), so when observation of the process is ended at the moment of the n-th failure). To derive the corresponding likelihood function, we take the product of the probabilities for these individual events, so

$$\begin{aligned}&\exp \{-m(0,t_1)\} \times m(t_1,t_1+\delta t_1)\exp \{-m(t_1,t_1+\delta t_1)\} \times \exp \{-m(t_1+\delta t_1,t_2)\} \times \ldots \\&\ldots \times \exp \{-m(t_n+\delta t_n,r)\} \\= & {} \left\{ \prod _{i=1}^n \left[ \int _{t_i}^{t_i+\delta t_i} v(t)dt \right] \right\} \times \exp \left[ - \int _0^r v(t)dt \right] . \end{aligned}$$

Now, use that for very small \(\delta t_i\), we have that

$$ \int _{t_i}^{t_i + \delta t_i} v(t)dt \approx v(t_i)\delta t_i. $$

Now we divide through by \(\prod _{i=1}^n \delta t_i\), and let all \(\delta t_i \downarrow 0\) (this is, exactly the same that leads to the PDFs appearing in the likelihood function for iid data). This gives the likelihood function, for this model and based on these n failure data and observation over the period [0, r]:

$$ L = \left\{ \prod _{i=1}^n v(t_i) \right\} \exp \left[ - \int _0^r v(t)dt \right] . $$

For optimisation, it is easier to use the log-likelihood function, which is also needed for related statistical inference, and which is equal to:

$$ l = \sum _{i=1}^n \ln v(t_i) - \int _0^r v(t)dt. $$

It is possible to work with this likelihood non-parametrically, but often one assumes a parametric form for the ROCOF, making maximum likelihood estimation again conceptually straightforward (although it normally requires numerical optimisation). Two simple, often used parametric ROCOFs are

$$ v_1(t) = \exp (\beta _0 + \beta _1 t) $$

and

$$ v_2(t) = \gamma \eta t^{\eta -1}, $$

with \(\gamma , \eta >0\).

Many models that have been suggested, during about the last three decades, for software reliability, are NHPPs which model the software testing process as a fault counting process. A famous model was proposed by Jelinski and Moranda in 1972, which is based on the following assumptions: (1) software contains an unknown number of bugs, N; (2) at each failure, one bug is detected and corrected; (3) the ROCOF is proportional to the number of bugs present. So, they use a NHPP with failure times \(T_i, i=1,\ldots ,N\) and \(T_0=0\), defined by

$$ v(t)= (N-i+1)\lambda , \; \; \text{ for } t\in [T_{i-1},T_i), $$

for some constant \(\lambda \). Then N and \(\lambda \) are both considered unknown, and estimated from data, where, of course, inference for N tends to be of most interest, or, in particular, the number of remaining bugs. Many authors have contributed to such theory by changing some model assumptions. For example, non-perfect repair of bugs has been considered, and even the possibility of such repair introducing new bugs (possibly a random number), for this last situation so-called ‘birth-death processes’ can be used. Also non-constant \(\lambda \) has been considered, e.g. with the idea that some bugs may tend to show earlier than others. Also Bayesian methods for such models, and even software reliability models more naturally embedded in Bayesian theory, have been suggested and studied. However, although there is an enormous amount of literature in this area, as indeed mathematical opportunities appear to have no limit here, the practical relevance of such models seems to be rather limited and few interesting applications of such models have been reported in software reliability. Recently, the important topic of testing of reliability of systems including software has received increasing attention, which is much needed to ensure reliable systems.

Useful sources

P.K. Andersen, O. Borgan, R.D. Gill and N. Keiding, Statistical Models Based on Counting Processes (Springer, 1993).

T. Aven and U. Jensen, Statistical Models in Reliability (Springer, 1999).

R.E. Barlow, Engineering Reliability (SIAM, 1998).

R.E. Barlow and F. Proschan, Mathematical Theory of Reliability (Wiley, 1965).

T. Bedford and R. Cooke, Probabilistic Risk Analysis: Foundations and Methods (Cambridge University Press, 2001).

P. Hougaard, Analysis of Multivariate Survival Data (Springer, 2000).

R.S. Kenett, F. Ruggeri and F.W. Faltin (Eds), Analytic Methods in Systems and Software Testing (Wiley, 2018).

J.F. Lawless, Statistical Models and Methods for Lifetime Data (Wiley, 1982).

R.D. Leitch, Reliability Analysis for Engineers (Oxford University Press, 1995).

H.F. Martz and R.A. Waller, Bayesian Reliability Analysis (Wiley, 1982).

W.Q. Meeker and L.A. Escobar, Statistical Methods for Reliability Data (Wiley, 1998).

N.D. Singpurwalla and S.P. Wilson, Statistical Methods in Software Engineering: Reliability and Risk (Springer, 1999).

Leading international journals in this field include Reliability Engineering and System Safety, IEEE Transactions on Reliability, Journal of Risk and Reliability, Quality and Reliability Engineering International. Statistical methods for reliability data are presented in a wide variety of theoretical and applied Statistics journals, theory and methods for decision support are often published in the Operations Research literature.