1 Introduction

The equipment reliability is subject to degradation and influencing factors which are referred to as covariates. The evaluation of system reliability has gained additional interests for quantifying the risk of degradation and failure, predicting the performance, making economic decisions in large-scale energy, transportation, and telecommunication infrastructures. Considering the example of a 750 kV electric power transformer with a typical capital cost of $ 2 million [1], the failure of such equipment could cause extensive outages and blackouts and raise customer interruption costs. It is therefore imperative to monitor the equipment health condition and evaluate its reliability to avoid catastrophic circumstances.

We consider two reliability indices: survival function (SF) and mean residual life (MRL). SF is the probability that the equipment will survive beyond a specified time. MRL renders an overall estimate and summarizes the residual life distribution of the equipment. Several failure rate models are considered for calculating the two reliability indices. However, a constant failure rate model is often used in reliability analyses which can pose erroneous results for the calculation of reliability indices [2,3,4].

The failure rate calculation should take both service age and covariates into consideration. The proportional hazard model (PHM) was introduced by Cox in 1972 to shape the hazard rate used in engineering and medicine [5, 6]. In PHM, the hazard rate consists of baseline and link functions. The baseline function offers the basis for hazard rate and the link function quantifies the covariate effect. Sample covariates such as those pertaining to the lifecycle data [7], operation mode [8], vibration [9] and dissolved gas [10], which could be time-dependent, are considered according to the actual system situations. PHM offers advantages when applied to explanatory techniques [11]. Accordingly, the model adopts the information on dissolved gas analysis (DGA) as covariate which affects the failure rates of power transformers.

The MRL and conditional/unconditional SFs are calculated in [12] by a discrete Markov process and PHM for obtaining the additional insight on interactions between time-varying failure rates and reliability indices. Makis and Jardine [13] utilize a full parametric PHM and a time-homogeneous Markov chain to describe failure rates. Accordingly, the optimal expected average cost and replacement time are obtained. Reference [14] evaluates the equipment reliability with imperfect observations. The observations are collected periodically. The failure rate is modeled based on PHM which takes both the age and the health condition into consideration. The same model is adopted to identify the optimal inspection period and the replacement policy [15]. The parameter estimation problem is studied in [16]. A control-limit policy and parameter estimation are proposed in [17] where an optimal replacement policy is obtained to minimize the average replacement costs per unit time. The above models are based on the assumption that the condition information of covariates is inspected at discrete points and every state transition happens only at the end of inspection interval, exactly before the next inspection instant, to make the calculation tractable within every interval. These models are called discrete monitoring and discrete transition (DMDT) models in this paper.

In fact, the state transition can happen at any time and this assumption may not conform to the reality. Reference [18] evaluates the SF policy by applying PHM and the presumption that the state transition is continuous. Reference [19] also assumes the condition monitoring is continuous if the inspection interval is small. These models are called continuous monitoring and continuous transition (CMCT) models in this paper. However, the condition monitoring may not be continuous in practice and the assumption would not be in line with the actual operation.

In practice, the online condition monitoring of power transformers, such as dissolved gas analysis, is discrete (periodical) while the state transition could happen at any time [20, 21]. In this paper we propose the discrete monitoring and continuous transition (DMCT) models for our analyses. The parameter estimation is found in [22] which is not addressed in this paper. The main contributions of this paper offered by our model are summarized below:

  1. 1)

    The proposed model is based on more practical assumptions in which the condition is discretely inspected but the state transition is continuous.

  2. 2)

    Service age and DGA information are introduced to customize the failure rate by applying PHM. The state transition of DGA is described by a semi-Markov process.

  3. 3)

    Analytical formulas are derived to evaluate SF and MRL using the given situations. The effectiveness of the proposed formulas is shown in our numerical studies.

2 Model description

2.1 Determine health condition with DGA information

All power transformers generate gases to some extent when they are operating normally. However, the incipient fault or degradation, such as overheating, partial discharge and arcing faults, will lead to the abnormality of gas-generating. A four-level criterion has been developed to classify health condition of transformers [23] according to the gases concentration. The gases include H2, CH4, C2H4, C2H6, CO, and CO2. The total gas of H2, CH4, C2H4, C2H6, CO is known as total dissolved combustible gases concentration (TDGC). Table 1 shows the classification of gas concentration conditions. Condition (1) is the best condition, and the condition gets worse with the number increases. The power transformer is regarded as being in the worse condition irrespective of the type of dissolved gas that is in the worse condition. In other words, the condition of power transformer depends on the worst condition of all the dissolved gases.

Table 1 Dissolved gas concentrations for condition classification based on IEEE C57.104

2.2 Failure rate model based on DGA information

The failure rate of power transformer is modelled by PHM and semi-Markov process. The failure rate is expressed as:

$$h(s,Z(s)) = h_{0} (s)\psi (Z(s)) = \frac{{\beta s^{\beta - 1} }}{{\alpha^{\beta } }}e^{\gamma Z(s)}$$
(1)

where \(h_{0} (s)\) is the baseline function to describe basic hazard rate; \(\psi (Z(s))\) is the link function to quantify the effect of covariates. The covariate \(Z(s)\) represents the condition of dissolved gas concentrations at time s. The degradation process is irreversible, which is the most common case that the degradation state cannot improve by itself. Without loss of generality, the gas condition \(Z(s)\) is assumed to fall into finite state space \(\{ 1,2, \ldots ,n\}\) where the condition deteriorates as the state number increases. The analytical formulas are also derived with n conditions. The state n is the worst and absorbing state. Upon a failure or scheduled maintenance, the component is maintained and restored to state 1 and the process is renewed. The model is shown in Fig. 1 which is described as follows:

Fig. 1
figure 1

Degradation and failure process

  1. 1)

    \(T_{i} ,\;X_{i} ,\;i = 1,2, \ldots ,n\), denote the ith state transition moment and the sojourn time of state i, respectively.

  2. 2)

    t 0 and t are the current time and future time point, respectively.

  3. 3)

    The transition of health condition is assumed to follow a semi-Markov process. The transition is irreversible and increases by one whenever a transition occurs. That is, a pure birth process is considered.

The Markov process is memoryless and Markov models can lead to serious errors on certain conditions. However, the health condition transition of a power transformer is affected by the operation history which is not a memoryless process. In our study, a semi-Markov process is introduced to describe the memorial degradation process of power transformers and evaluate reliability indices. In our case studies, the results of the contrast of Markov and semi-Markov processes and some reasonable conclusions have been drawn. Other stochastic processes could also be also introduced to model the condition transition indeed. Based on our proposed model, we plan on performing more work in the future to compare the performance of different stochastic process when evaluating the reliability of power transformers.

Let Z i denote the degradation state between T i−1 and T i . In a pure birth process Z i  = i. Since the state n is an absorbing state, we define X n  = ∞ and T n  = ∞. Clearly \(X_{i} = T_{i} - T_{i - 1}\) and its distribution is denoted as:

$$G(x_{i} ) = P(T_{i} - T_{i - 1} \le x_{i} |Z_{i} = i,Z_{i + 1} = i + 1)$$
(2)

where x i is the independent variable in the distribution function of X i . The state sojourn time \(X_{1} ,X_{2} , \ldots ,X_{n}\) are conditional independent in a semi-Markov process. That is, the Markovian property is satisfied at the transition point rather than the entire process.

The joint probability density function (PDF) of X 1 , X 2,…,X n is represented as g x which equals to \(g(x_{1} ,x_{2} , \ldots ,x_{n - 1} ) = g_{1} g_{2} \ldots g_{n - 1}\), where \(g_{i} = g(x_{i} )\) is the probability density function of x i .

It should be noted that:

$$T_{k} = \sum\limits_{i = 1}^{k} {X_{i} } = X_{1} + X_{2} + \cdots + X_{k}$$
(3)

and

$$Z(t) = k \Leftrightarrow T_{k - 1} \le t < T_{k}$$
(4)

where \(Z(t)\) is the gas concentration state at time t. The conditional survival function\(R(t|t_{0} )\) is given by:

$$P(T \ge t|T > t_{0} ,Z(s),0 \le s \le t_{0} ) = \exp \left( { - \int_{{t_{0} }}^{t} {h(s,Z(s)){\text{d}}s} } \right)$$
(5)

where T is the failure time. Given t 0 and \(Z(t_{0} ) = j\), the component or the system may stay at an arbitrary state from j to n at any future time t. For \(Z(t_{0} ) = j,\;Z(t) = k\), \(1 \le j \le k \le n\), we have:

$$\begin{aligned} R_{jk} (t|t_{0} ) = P(T > t|T > t_{0} ,Z(t_{0} ) = j,Z(t) = k) = \hfill \\ \exp \left( { - \int_{{t_{0} }}^{{T_{j} }} h (s,Z_{j} ){\text{d}}s - \int_{{T_{j} }}^{{T_{j + 1} }} h (s,Z_{j + 1} ){\text{d}}s - \cdots - \int_{{T_{k - 1} }}^{t} h (s,Z_{k} ){\text{d}}s} \right) \hfill \\ \end{aligned}$$
(6)

If we view \((X_{1} ,X_{2} , \ldots ,X_{n - 1} )\) as a (n-1) dimensional space, (6) would be satisfied only in the sub-region D jk .

$$D_{jk} = \left\{ {(X_{1} ,X_{2} , \ldots ,X_{n - 1} )\left| {\begin{array}{*{20}c} {x_{1} + \ldots + x_{j - 1} \le t_{0} } \\ {x_{1} + \cdots + x_{j} > t_{0} } \\ {x_{1} + \ldots + x_{k - 1} \le t} \\ {x_{1} + \cdots + x_{k} > t} \\ {x_{i} \ge 0,1 \le i \le n - 1} \\ \end{array} } \right. \, } \right\}$$
(7)

In this paper, x i is assumed to be larger than zero. For instance,\(D_{12} = \{ (X_{1} ,X_{2} , \ldots ,X_{n} )|x_{1} > t_{0} ,\;x_{1} \le t < x_{1} + x_{2} \}\), when\(j = 1,\;k = 2\) in (7), which means the component is in state 1 at t 0 and state 2 at t. Also, in the area D 12, \(R_{12} (t|t_{0} ) = \exp ( - \int_{{t_{0} }}^{{T_{1} }} h (s,Z_{1} ){\text{d}}s - \int_{{T_{1} }}^{t} h (s,Z_{2} ){\text{d}}s)\). Thus, when the state at any future time t varies from 1 to n, \(R(t|t_{0} )\) can be viewed as a piecewise function in the n dimensional space \(\{ R(t|t_{0} ),X_{1} , \ldots ,X_{n - 1} \}\). Accordingly, \(R_{jk} (t|t_{0} )\) represents \(R(t|t_{0} )\) in the sub-region \(D_{jk}\). The boundary of each sub-region is decided by t 0 and t, and the corresponding degradation states \(Z(t_{0} )\) and \(Z(t)\).

Generally, MRL is calculated by \(M(t_{0} ) = \int_{{t_{0} }}^{\infty } R (t|t_{0} ){\text{d}}t\). Since \(R(t|t_{0} )\) is a piecewise function from t 0 to infinity, the conditional MRL, given t 0 and \(Z(t_{0} ) = j\), can be expressed as:

$$\begin{aligned} &m_{j} (t_{0} ) = \int_{{t_{0} }}^{\infty } R (t|t_{0} ){\text{d}}t = \int_{{t_{0} }}^{{T_{j} }} {R_{jj} } (t|t_{0} ){\text{d}}t + \hfill \\&\qquad \quad \int_{{T_{j} }}^{{T_{j + 1} }} R_{j,j + 1} (t|t_{0} ){\text{d}}t + \ldots + \int_{{T_{n - 1} }}^{\infty } R_{jn} (t|t_{0} ){\text{d}}t \hfill \\&\qquad = \sum\limits_{i = j}^{n} {M_{jk} } (t_{0} ) = M_{jj} (t_{0} ) + M_{j,j + 1} (t_{0} ) + \ldots + M_{jn} (t_{0} ) \hfill \\ \end{aligned}$$
(8)

where \(M_{ji} (t_{0} ) = \int_{{T_{i - 1} }}^{{T_{i} }} {R_{ji} } (t|t_{0} ){\text{d}}t\) for \(j < i \le n\).

With (5), (6) and (8) in place, there is still one barrier in evaluating SF and MRL, effectively, where the explicit analytical expressions are needed.

3 Evaluating SF and MRL

The DGA is inspected at discrete points, and the inspection instants are equally spaced. Figure 2 shows the inspection points, state transition points and time points, where \(\Delta_{l}\) means the lth inspection point, S1 and S2 represent two different situations, respectively.

Fig. 2
figure 2

Inspection points, transition points and time points

The formulas are presented in two situations: t 0 (S1) is exactly the inspection point and t 0 (S2) is between the inspection points. The t 0 point exhibits a big influence on the expression and the shape of SF and MRL. In fact, whether t 0 is the inspection point has a practical significance. The t 0 as the exact inspection point corresponds to the situation where the DGA condition information has just been collected from an on-line or off-line test, while the t 0 located between two successive inspection points corresponds to a situation where the condition inspection by either an on-line or off-line test has been done before and the next inspection point has not been reached. Both S1 and S2 situations could occur in practice.

Note that SF and MRL are the functions of random variables X 1, X 2,…, X n-1, which are multiple integral in the variable space. The known conditions constitute the composite constraint surface of the integral region. In this section, the multiple integral is transformed to the repeated integral to derive the formulas of SF and MRL.

3.1 Survival function

The survival function is a piecewise function associated with random variables which is calculated by the conception of expectation.

  1. 1)

    A new component

For a new component, we have \(t_{0} = 0,\;Z(t_{0} ) = 1\). According to the Law of Total Probability, \(R(t)\) can be expressed as:

$$\begin{aligned} R(t) = \sum\limits_{k = 1}^{n} P (T > t,Z(t) = k|Z(t_{0} ) = 1) \\ = \sum\limits_{k = 1}^{n} {\int_{{D_{1k} }} {R_{1k} } } (t)g(x_{1} , \ldots ,x_{k} ){\text{d}}x_{1} \ldots {\text{d}}x_{k} \\ \quad= r_{11} + r_{12} + \ldots + r_{1n} \\ \end{aligned}$$
(9)

where \(r_{1k} = \int_{{D_{1k} }} {R_{1k} } (t)g(x_{1} , \ldots ,x_{n - 1} ){\text{d}}x_{k} \ldots {\text{d}}x_{1} , k=1, 2,\ldots,n-1\) can be calculated as follows:

$$r_{1k} = \left\{ {\begin{array}{*{20}l} {\int_{0}^{t} {\int_{0}^{{t - x_{1} }} } \ldots \int_{0}^{{t - T_{k - 2} }} \int_{{t - T_{k - 1} }}^{\infty } {R_{1k} } (t)g_{x} {\text{d}}x_{k} \ldots {\text{d}}x_{2} {\text{d}}x_{1} } \hfill \\ {\qquad \qquad \qquad \qquad \qquad \;\;k = 1,2, \ldots ,n - 1} \hfill \\ {\int_{0}^{t} {\int_{0}^{{t - x_{1} }} } \ldots \int_{0}^{{t - T_{n - 2} }} \int_{0}^{{t - T_{n - 1} }} {R_{1n} } (t)g_{x} {\text{d}}x_{n - 1} \ldots {\text{d}}x_{1} } \hfill \\ {\qquad \qquad \qquad \qquad \;\;\;\;\;\;\;\;\;k = n} \hfill \\ \end{array} } \right.$$
(10)

The proof of (10) is presented in Appendix A.

  1. 2)

    An old component

An old component has survived and suffered from degradation by the time t 0. Assuming the last inspection instant is \(\Delta_{m}\) and \(Z(\Delta_{m} ) = j\), we have \(T_{j} > \Delta_{m}\). In fact, state transition points are renewal points and have Markovian property [3], thus the state transition after T j-1 has nothing to do with the history before and can happen at any time.

The calculation of survival function falls into the two cases which are designated as old1 and old2.

Situation S1: t 0 is exactly the inspection instant \(\Delta_{m}\). In this case, the known conditions are given: ① \(T > t_{0}\); ② \(Z(t_{0} ) = j\); ③ X 1, X 2,…, X j−1. According to the Law of Total Probability, \(R(t|t_{0} )\) is equal to:

$$\begin{aligned} R(t|t_{0} ) &= \sum\limits_{k = j}^{n} P (T > t,Z(t) = k|T > t_{0} ,Z(t_{0} ) = j) \hfill \\ \;\;\;\;\;\;\;\;\;\;\; &= \sum\limits_{k = j}^{n} {\int_{{D_{jk} }} {R_{jk} } } (t|t_{0} )g_{{x|x_{j} }} {\text{d}}x_{j} \ldots {\text{d}}x_{n - 1} \hfill \\ \end{aligned}$$
(11)

Let \(r_{jk}^{\text{old1}}\) denote \(\int_{{D_{jk} }} {R_{jk} } (t|t_{0} )g_{{x|x_{j} }} {\text{d}}x_{j} \ldots {\text{d}}x_{n - 1}\) and \(r_{jk}^{\text{old1}}\) can be calculated by (12). The proof is shown in Appendix B.

$$r_{jk}^{\text{old1}} = \left\{ {\begin{array}{ll}\int_{{t_{0} - T_{j - 1} }}^{{t - T_{j - 1} }} int_{0}^{{t - T_{j -1} - x_{j} }} \ldots \int_{0}^{{t - x_{1} - \ldots - x_{k - 2} }}\int_{{t - x_{1} - \ldots - x_{k - 1} }}^{\infty } {R_{jk} } \left({t|t_{0} } \right)g_{{x|x_{j} }} {\text{d}}x_{k} \ldots{\text{d}}x_{j}\hfill & \quad {j \le k < n} \hfill \\\int_{{t_{0} - T_{j - 1} }}^{{t - T_{j - 1} }} \int_{0}^{{t - T_{j -1} - x_{j} }} \ldots \int_{0}^{{t - x_{1} - \ldots - x_{k - 2} }}\int_{0}^{{t - x_{1} - \ldots - x_{n - 1} }} {R_{jn} \left( {t|t_{0}} \right)} g_{{x|x_{j} }} {\text{d}}x_{n - 1} , \ldots{\text{d}}x_{j} \hfill & \quad {k = n} \hfill \\ \end{array}}\right.$$
(12)

Situation S2: t 0 is between two successive inspection instants \(\Delta_{m}\) and \(\Delta_{m + 1}\).

In this case, \(Z(t_{0} )\) can be arbitrarily selected from \(Z(\Delta_{m} )\) to n since t 0 is not an inspection point. The known conditions are given: ① \(T > t_{0}\); ② \(Z(\Delta_{m} ) = j\); ③ X 1, X 2,…, X j-1. \(R(t|t_{0} )\) is denoted by:

$$\begin{aligned} R(t|t_{0} ) &\,= P(T > t|T > t_{0} ,Z(\Delta_{m} ) = j) \hfill \\ &= \sum\limits_{i = j}^{n} {\sum\limits_{k = i}^{n} P } (T > t,Z(t) = k,Z(t_{0} ) = i|T > t_{0} ,\quad Z(\Delta_{m} ) = j) \hfill \\ &= \sum\limits_{i = j}^{n} {\sum\limits_{k = i}^{n} {\int_{{D_{ik} }} {R_{ik} } } } (t|t_{0} )g_{{x|x_{j} }} {\text{d}}x_{j} \ldots {\text{d}}x_{n - 1} \hfill \\ \end{aligned}$$
(13)

Let \(r_{j,ik}^{\text{old2}}\) be \(\int_{{D_{ik} }} {R_{ik} } (t|t_{0} )g_{{x|x_{j} }} {\text{d}}x_{j} \ldots {\text{d}}x_{n - 1}\) which represents the survival probability in the sub-region \(Z(\Delta_{m} ) = j\), \(Z\left( {t_{0} } \right) = i\) and \(Z\left( t \right) = k\). \(r_{j,ik}^{\text{old2}}\) can be evaluated by (14). The proof is shown in Appendix C.

$$r_{j,ik}^{\text{old2}} = \left\{ \begin{aligned} \int_{{\Delta_{m} - T_{j - 1} }}^{{t_{0} - T_{j - 1} }} \int_{0}^{{t_{0} - T_{j - 1} - x_{j} }} \ldots \int_{{t_{0} - x_{1} - \ldots - x_{i - 1} }}^{{t - x_{1} - \ldots - x_{i - 1} }} \ldots \int_{0}^{{t - x_{1} - \ldots - x_{k - 2} }} \int_{{t - x_{1} - \ldots - x_{k - 1} }}^{\infty } \;\;R_{jk} \left( {t|t_{0} } \right)g_{{x|x_{j} }} {\text{d}}x_{j} \ldots {\text{d}}x_{k} \quad i \le k < n \hfill \\ \int_{{\Delta_{m} - T_{j - 1} }}^{{t_{0} - T_{j - 1} }} \int_{0}^{{t_{0} - T_{j - 1} - x_{j} }} \ldots \int_{{t_{0} - x_{1} - \ldots - x_{i - 1} }}^{{t - x_{1} - \ldots - x_{i - 1} }} \ldots \int_{0}^{{t - x_{1} - \ldots - x_{n - 2} }} \int_{0}^{{t - x_{1} - \ldots - x_{n - 1} }} \;R_{jk} \left( {t|t_{0} } \right)g_{{x|x_{j} }} {\text{d}}x_{n - 1} \ldots {\text{d}}x_{j} \quad k = n \hfill \\ \end{aligned} \right.$$
(14)

3.2 Mean residual life

We evaluate MRL for a new component and an old component, respectively. For an old component, MRL is also calculated in two cases according to whether t 0 is the inspection instant.

  1. 1)

    A new component

For a new component, \(t_{0} = 0,\;Z(t_{0} ) = 1\), according to (8) MRL can be expressed as:

$$\begin{aligned} M(t_{0} ) &= E[M(t_{0} |Z(t_{0} ) = 1)] = E[M_{11} + M_{12} + \ldots + M_{1n} ] \\ &= \int_{{t_{0} }}^{\infty } \int_{0}^{\infty } \ldots \int_{0}^{\infty } {(\sum\limits_{i = j}^{n} {M_{ji} } (t_{0} ))} g_{{x|x_{j} }} {\text{d}}x_{n - 1} \ldots {\text{d}}x_{j} \\ &= \int_{0}^{\infty } {M_{11} } (t_{0} )g_{1} {\text{d}}x_{1} + \int_{0}^{\infty } {\int_{0}^{\infty } {M_{12} } } (t_{0} )g_{1} g_{2} {\text{d}}x_{2} {\text{d}}x_{1} \\ \;\; &\quad+ \int_{0}^{\infty } \int_{0}^{\infty } \ldots \int_{0}^{\infty } {M_{1n} } (t_{0} )g_{x} {\text{d}}x_{n - 1} \ldots {\text{d}}x_{2} {\text{d}}x_{1} \\ \end{aligned}$$
(15)
  1. 2)

    An old component

For an old component, \(T > t_{0}\) and \(Z(\Delta_{m} ) = j\). Whether the health condition \(Z(t_{0} )\) is known to depend on whether t 0 is inspection point.

Situation S1: t 0 is exactly the inspection instant.

In this case, \(t_{0} = \Delta_{m}\) and \(Z(t_{0} ) = j\), MRL is denoted by:

$$\begin{aligned} M(t_{0} ) &= E[M(t_{0} |Z(t_{0} ) = 1)] = E[M_{jj} + M_{j,j + 1} + \ldots + M_{jn} ] \hfill \\ &= \int_{{t_{0} }}^{\infty } M_{jj} (t_{0} )g_{{j|x_{j} }} {\text{d}}x_{j} + \int_{{t_{0} }}^{\infty } \int_{0}^{\infty } M_{j,j + 1} (t_{0} )g_{{j|x_{j} }} g_{j + 1} {\text{d}}x_{j + 1} {\text{d}}x_{j} \hfill \\ \;\; &\quad+ \int_{{t_{0} }}^{\infty } \int_{0}^{\infty } \ldots \int_{0}^{\infty } {M_{jn} } (t_{0} )g_{{x|x_{j} }} {\text{d}}x_{n - 1} \ldots {\text{d}}x_{j + 1} {\text{d}}x_{j} \hfill \\ \end{aligned}$$
(16)

The old2 case: t 0 is between two successive inspection instants \(\Delta_{m}\) and \(\Delta_{m + 1}\).

Given t 0 and \(Z(\Delta_{m} ) = j\), MRL is calculated by:

$$M(t_{0} ) = \sum\limits_{i = j}^{n} \int_{{\begin{array}{*{20}l} {Z(\Delta_{m} ) = j} \hfill \\ {Z(t_{0} ) = i} \hfill \\ \end{array} }} {M(t_{0} |Z(t_{0} ) = i)g_{{x|x_{j} }} {\text{d}}x_{n - 1} \ldots {\text{d}}x_{j + 1} {\text{d}}x_{j} }$$
(17)

The integral region of \(m_{i} (t_{0} )\) is equal to:

$$\left\{ {(X_{1} ,X_{2} , \ldots ,X_{j} )\left| {\begin{array}{*{20}c} {x_{j} > \Delta_{m} - T_{j - 1} } \\ {x_{j} + x_{j + 1} + \ldots + x_{i - 1} < t_{0} - T_{j - 1} } \\ {x_{j} + x_{j + 1} + \ldots + x_{i} > t - T_{j - 1} } \\ \end{array} } \right.} \right\}$$
(18)

Similar to the proof of (14), project the region onto the lower dimension space repeatedly so that the multiple integral is transformed into repeated integral. The upper and lower limits can be obtained as in (14).

The steps for evaluating SF and MRL are summarized as: ① obtain \(t_{0} ,\;\Delta_{m} ,\;Z(\Delta_{m} ),\;T_{i} ,\;i < Z(\Delta_{m} )\) and determine \(g_{{x|x_{j} }}\) according to historical inspection data; ② evaluate the survival function by (9), (11) and (13); ③ evaluate the mean residual life by (15), (16) and (17).

4 Numerical examples

The parameter estimation is not addressed in this paper. The DGA information and failure rate parameters in [24] are adopted. The first numerical example is to compare the results obtained by developed formulas with those of the Monte Carlo technique. In [25], the same reliability indices, SF and MRL, are evaluated by the Monte Carlo technique. The minimal errors indicate the accuracy of developed formulas.

Assume that the baseline function has a Weibull distribution and the link function follows exponential form:

$$\left\{ \begin{aligned} h_{0} (s) = \tfrac{{\beta s^{\beta - 1} }}{{\alpha^{\beta } }}\;\;\;s \ge 0,\alpha = 10,\beta = 2 \hfill \\ \psi (Z(s)) = e^{{1 \times \left( {Z(s) - 1} \right)}} \hfill \\ \end{aligned} \right.$$
(19)

For simplicity, we assume \(n = 3\). In other words, the health condition is divided into 3 stages \(\{ 1,2,3\}\) and the sojourn time X 1 and X 2 are s-independent and identically distributed Weibull random variables. The PDF of X i is given:

$$\begin{aligned} g_{i} = g(x_{i} ) = \frac{{bx_{i}^{b - 1} }}{{a^{b} }}\exp \left[ { - \left( {\frac{{x_{i} }}{a}} \right)^{b} } \right]\;\;\;x_{i} > 0,i = 1,2, \ldots ,n \hfill \\ \hfill \\ \end{aligned}$$
(20)

where \(a = 11.2838\) and \(b = 2\). It is not hard to know \(EX_{i} = 10\).

We sample \(X_{1} ,\;X_{2}\) by the Monte Carlo technique and calculate the SF and MRL by (6) and (8).The convergence condition is that the coefficient of variation is less than 0.05. For a new component \(t_{0} = 0,t = t_{0} + 5\); for an old component, we presume \(\Delta_{m} = t_{0} = 4\), \(Z(\Delta_{m} ) = 1\), \(t = t_{0} + 5\) in case 1 and \(\Delta_{m} = 4\), \(t_{0} = 5\), \(Z(\Delta_{m} ) = 1\), \(t = t_{0} + 1\) in case 2. The results given by the proposed analytical method and the Monte Carlo method are contrasted in Table 2 where old1 and old2 represent the two different cases for an old component, respectively.

Table 2 SF and MRL given by RIBAM and Monte Carlo method

From Table 2 we can see that the results obtained by the two methods are very close. Moreover, the proposed analytical formulas offer more advantages. Theoretically, the analytical formulas always provide a concise value which could be distinct from those offered Monte Carlo which varies based on the designated simulation parameters. The Monte Carlo method is based on the process of “sampling-evaluation-convergence” and the accuracy of results depends on the convergence condition and the number of samples. Although the uncertainty could be controlled within a given range, the evaluation results vary by samples. In other words, two sets of samples may pose different results even if they both satisfy the convergence conditions. On the other hand, the error in the proposed analytical method depends on rounding as long as the equations are stated correctly.

In the proposed model, the analytical method has a shorter calculation time and higher accuracy than the Monte Carlo method when \(n \le 4\). When n is larger than 5, the computation of repeated integral could be a heavy burden. Therefore the proposed analytical method no longer has a computation time advantage. However, the analytical formulas present a higher accuracy for either n ≤ 4 or n ≥ 5. In this numerical example, the calculation time of SF is 0.18 s using the analytical method on a 2.6 GHz computer, while the time used by the Monte Carlo method varies between 0.5 s and 5 s.

In most practical cases, human operators would like to observe precise power transformer conditions. A large number of health condition stages will make the problem more complex and reduce the decision making efficiency. Moreover, the evaluation accuracy depends on the established model, monitoring data and parameters estimation. A large number of stages can reduce the evaluation efficiency and accuracy. Therefore, four stages of aging condition are deemed enough for the reliability evaluation of power transformers.

The DMDT models always assume that the equipment conditions stay the same between two successive inspection instances to make the SF calculation tractable. Furthermore, the CMCT model assumes rather superficially that the equipment condition would always be available. Hence, both models can produce errors. To illustrate the advantages of the proposed model, we concentrate on the equipment reliability in a single inspection interval. The survival function \(P\left( {T > t_{0} + \Delta |T > t_{0} ,\;Z(t_{0} ) = 1} \right)\) under \(\Delta = 0.4\) and \(\Delta = 0.8\) are shown in Table 3 and Figs. 35.

Table 3 SF and MRL given by analytical formulas and Monte Carlo method
Fig. 3
figure 3

Survival function when \(\Delta = 0.4\)

From Table 3, Figs. 3 and 4, we can see that the SF obtained from DMDT is always larger than that from DMCT, which means that we will overestimate the reliability if we ignore the state transition between inspection points. At the inspection points the results obtained by the CMCT and DMCT models are the same which are shown in bold in Table 3. The difference between DMCT and CMCT models is that DMCT model is under discrete monitoring while CMCT model is under continuous monitoring. However, at inspection points the state is known and the uncertainty of \(Z\left( {t_{0} } \right)\) is eliminated. Thus the difference between DMCT and CMCT disappears and the DMCT model degrades into the CMCT model.

Fig. 4
figure 4

Survival function when \(\Delta = 0.8\)

For non-inspection points, the SF calculated in the DMCT model is smaller than that calculated in the CMCT model which is due to the possible state transition between the last inspection instant and the current time. The component reliability decreases with time in the long run. Besides, the DMCT curve shows that the reliability decreases deeper when \(t_{0} - \Delta_{m}\) is larger. It is reasonable to assume that the longer the difference between the last inspection instant \(\Delta_{m}\) and the current time t 0, the bigger would be the error if we regard \(Z(t_{0} )\) as \(Z(\Delta_{m} )\). Besides, the older component tends to pose a larger error. This confirms the intuitive notion that the longer a component stays in a state (except absorbing state), the higher is the likelihood that it would transit to another (mostly worse) state and the larger would be the error unless we consider the state transition within the next inspection interval.

Figure 5 shows how the health condition strongly affects the shape of SFs. The component condition transits from state 1 to state 2 at time 10. The worse health condition offers a sharper decrease in SF. The diversity of component health condition is not considered in the traditional exponential or the Weibull distribution which will lead to serious errors.

Fig. 5
figure 5

Survival function under state transition at time 10

To illustrate the influence of Weibull parameters of sojourn time distribution (STD), we vary the shape parameter b from 0.5 to 5 and change the scale parameter a to make sure the expectation of sojourn time is 10. We include the coefficient of variation of the sojourn time distribution in Table 4 to gain more insight. CV is usually introduced to describe the dispersion degree of distribution. In this case, we would like to observe how the distribution of condition sojourn time changes the reliability indices even though two distribution functions have the same expected condition sojourn time.

Table 4 SF and MRL in different Weibull distribution parameters

Five groups of parameters \((a,b,{\text{CV}})\) are (5, 0.5, 2.2361), (8.8261, 0.8, 1.2605), (8.8261, 0.8, 1.2605), (10, 1, 1), (11.2838, 2, 0.5227), (10.8912, 5, 0.2290). In Table 4, even though the five groups of transformers have the same expected condition sojourn time, they have different reliability indices. A bigger CV always leads to a lower survival function and shorter mean residual life.

The survival functions are calculated for a new component. The five SFs are shown in Fig. 6. The semi-Markov process degenerates to a Markov process when b=1. This observation implies that the error is inevitable if we always assume the state transition conforms to a Markov process. Another notable observation is that the variation in sojourn time distribution parameters can lead to a different SF curves, though they all follow the Weibull distribution and have the same mean value. The equipment reliability declines sharper with the increase in CV. It is reasonable to assume that a larger variability always offers a lower reliability and boosts the cost on maintenance. The MRLs shown in the last column in Table 4 indicates that larger CV also means a shorter mean equipment life. The presented results are for n = 4. There are similar conclusions for \(n > 4\).

Fig. 6
figure 6

SF under different Weibull sojourn time distribution

5 Conclusion

We develop analytical formulas based on a more realistic DMCT model for evaluating the equipment reliability of deteriorating systems. The DMCT model assumption ensures that the results agree with the practice. The minimal errors between analytical formulas and Monte Carlo results imply the accuracy of the proposed method. Furthermore, the proposed method offers more realistic results in a shorter calculation time.

By comparing SF and MRL in the three models, we learn that the reliability will be overestimated if we apply a DMDT model between inspection points. That is, assuming \(Z(t_{0} ) = Z(\Delta_{m} )\) or \(Z(t) = Z(t_{0} )\) will bring inevitable errors. The longer inspection interval will result in a longer transition between the last and the current states with a larger error. For non-monitoring points, the DMCT results are different from those of CMCT. However, for the monitoring point, DMCT and CMCT models have the same results, i.e., the DMCT model degrades into the CMCT model. This also indicates that the CMCT model cannot conform to practical cases since it is unrealistic to obtain the health condition at all points. Another observation is that a larger CV always refers to a lower reliability despite the same state sojourn time expectation. We draw a conclusion that two sets of products will offer different reliability results although they have the same s-expected state sojourn time. Steady quality (means a smaller CV) is essential to achieve a higher reliability. A greater variation of quality always tends to shorten the MRL and boost the cost.