1 Introduction

Availability represents the proportion of a system’s uptime out of the total time in service and is one of the most critical aspects of performance evaluation. Availability is commonly measured as Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR). However, those “mean” values are normally “averaged”; thus, some useful information (e.g., trends, system complexity) may be neglected, and some problems may even be hidden.

Assessment of system availability has been studied from the design stage to the operational stage in various system configurations (e.g., in series, parallel, k-out-of-n, stand-by, multi-state, or mixed architectures). Approaches to assessing system availability mainly use either analytic or simulation techniques.

In general, analytic techniques represent the system using direct mathematical solutions from applied probability theory to make statements on various performance measures, such as the steady-state availability or the interval availability (Dekker and Groenendijk 1995; Ocnasu 2007). Researchers tend to use Markov models to assess dynamic availability or semi-Markov models using Laplace transforms to determine average performance measures (Dekker and Groenendijk 1995; Faghih-Roohi et al. 2014). However, such approaches have been criticised as too restrictive to tackle practical problems; they assume constant failure and repair rates which is not likely to be the case in the real world (Raje et al. 2000; Marquez et al. 2005). Furthermore, the time dependent availability obtained by a Markovian assumption is actually not valid for non-Markovian processes (Raje et al. 2000).

Simulation techniques estimate availability by simulating the actual process and random behaviour of the system. The advantage is that non-Markov failures and repair processes can be modelled easily (Raje et al. 2000). Recent research is working on developing Monte Carlo techniques to model the behaviour of complex systems under realistic time-dependent operational conditions (Marquez et al. 2005; Marquez and Iung 2007; Yasseri and Bahai 2018) or to model multi-state systems with operational dependencies (Zio et al. 2007). Although simulation is more flexible, it is computationally expensive.

Traditionally, Bayesian approaches have been used to assess system availability as they can solve the problem of complicated system state changes and computationally expensive simulation data; however, their development and application were stalled by the strict assumptions on prior forms and by computational difficulties. Research is more concerned with the prior’s selection or the posterior’s computation than the reality (Brender 1968a, b; Kuo 1985; Sharma and Bhutani 1993; Khan and Islam 2012).

The recent proliferation of Markov Chain Monte Carlo (MCMC) simulation techniques has led to the use of the Bayesian inference in a wide variety of fields. Because of MCMC’s high dimensional numerical integral calculation (Lin 2014), the selection of prior information and descriptions of reliability/maintainability can be more flexible and more realistic.

This study proposes a new approach to system availability assessment: a parametric Bayesian approach with MCMC, with a focus on the operational stage, using both analytical and simulation methods. MTTF or MTTR are treated as distributions instead of being “averaged” by point estimation, and this is closer to reality; in addition, the limitations of simulation data sample size are addressed by using MCMC techniques.

The rest of this paper is organized as follows. Section 2 describes the problem statement, the balling drum system, the data preparation, and the preliminary analysis of failure and repair data. Section 3 proposes a Bayesian Weibull model for MTTF and a Bayesian lognormal model for MTTR and explains how to use an MCMC computational scheme to obtain the parameters’ posterior distributions. Section 4 presents a case study, results, and discussion. Section 5 offers conclusions and suggestions for further study.

2 Problem statement

This section presents the study problem statement, the balling drum system and its configuration, the system availability framework, and data preparation; it performs a preliminary analysis of failure and repair data based on which parametric Bayesian models are constructed subsequently.

2.1 Balling drum systems in the mining industry

Our study is motivated by a balling drum system in the mining industry. The case study mine consists of five balling drums, labelled 1–5 (see Fig. 1). All five balling drums receive their feed for production in the same manner. Each balling drum is expected to produce the same amount of pellets at its maximum. According to the working mechanism and an i.i.d test, they are regarded as independent; if one of the balling drums breaks down, it does not affect the rest of the balling drums, except that total production will be reduced. One assumption is made here that the system will fail only if all subsystems fail; therefore, it is treated as a parallel system.

Fig. 1
figure 1

Description of a balling drum and the system sketch

The availability of a single balling drum, denoted as A, can be computed by

$$A = \frac{MTTF}{MTTF + MTTR}$$
(1)

According to Fig. 1, the five balling drums are in parallel. The total system availability, \({\text{A}}_{\text{system}}\), can be calculated as

$$A_{system} = 1 - \mathop \prod \limits_{i = 1}^{5} (1 - A_{i} )$$
(2)

2.2 Data preparation and preliminary analysis

The study uses the failure and repair data of the five balling drums from January 2013 to December 2018. There are 1782 records. In the first step, the null values are removed, and the data are reduced to 1774 records.

The next step reveals there are different reasons for the TTF and TTR of individual balling drums. It is noticed that, for TTR data, if 150 shutdowns are considered normal (denoted as a threshold, see Fig. 2), then those exceeding 150 should be treated as abnormal and investigated using Root Cause Analysis (RCA).

Fig. 2
figure 2

Example of TTR data for balling drum 1

After checking the work order types of such kind of abnormal data, it is found that most of them are caused by “preventive maintenance” which may due to lack of maintenance resources. To simplify the study, we assume all maintenance resources are sufficient for “preventive maintenance”; thus, the abnormally data might be caused by shortage of spare parts or skilled personnel will not be treated specially in this paper.

To determine the baseline distribution of Time to Failure (TTF) and Time to Repair (TTR), we conduct a preliminary study of failure data and repair data using traditional analysis. In this preliminary study, several distributions are considered: exponential distribution, Weibull distribution, normal distribution, log-logistic distribution, lognormal distribution, and extreme value distribution. Table 1 lists the results.

Table 1 Preliminary study of failure data and repair data

Based on the results, the Weibull distribution and lognormal distribution are selected for the TTF and TTR for balling drums 1–5; these are applied to the parametric Bayesian models in the next section.

3 Parametric Bayesian Models

This section proposes a Bayesian Weibull model for TTF and a Bayesian lognormal model for TTR in the proposed parametric Bayesian models and explains the procedure of MCMC computational scheme to obtain the posterior distributions.

3.1 Markov Chain Monte Carlo with Gibbs sampling

The recent proliferation of Markov Chain Monte Carlo (MCMC) approaches has led to the use of the Bayesian inference in a wide variety of fields. MCMC is essentially Monte Carlo integration using Markov chains. Monte Carlo integration draws samples from the required distribution and then forms sample averages to approximate expectations. MCMC draws out these samples by running a cleverly constructed Markov chain for a long time. There are many ways of constructing these chains. The Gibbs sampler is one of the best known MCMC sampling algorithms in the Bayesian computational literature. It adopts the thinking of “divide and conquer”: i.e., when a set of parameters must be evaluated, the other parameters are assumed to be fixed and known. Let \(\uptheta_{\text{i}}\) be an i-dimensional vector of parameters, and let \({\text{f}}\left( {\uptheta_{\text{j}} } \right)\) denote the marginal distribution for the jth parameter. The basic scheme of the Gibbs sampler for sampling from \({\text{p}}\left(\uptheta \right)\) is given as follows:

  • Step 1. Choose an arbitrary starting point \(\theta^{\left( 0 \right)} = \left( {\theta_{1}^{\left( 0 \right)} , \ldots ,\theta_{k}^{\left( 0 \right)} } \right)\);

  • Step 2. Generate \(\theta_{1}^{\left( 1 \right)}\) from the conditional distribution \(f\left( {\theta_{1} |\theta_{2}^{\left( 0 \right)} , \ldots ,\theta_{k}^{\left( 0 \right)} } \right)\), and generate \(\theta_{2}^{\left( 1 \right)}\) from the conditional distribution distribution \(f\left( {\theta_{2} |\theta_{1}^{\left( 1 \right)} ,\theta_{3}^{\left( 0 \right)} , \ldots ,\theta_{k}^{\left( 0 \right)} } \right);\)

  • Step 3. Generate \(\theta_{j}^{\left( 1 \right)}\) from \(f\left( {\theta_{j} |\theta_{1}^{\left( 1 \right)} , \ldots ,\theta_{j - 1}^{\left( 1 \right)} ,\theta_{j + 1}^{\left( 1 \right)} \ldots ,\theta_{k}^{\left( 0 \right)} } \right)\);

  • Step 4. Generate \(\theta_{k}^{\left( 1 \right)}\) from \(f\left( {\theta_{k} |\theta_{1}^{\left( 1 \right)} ,\theta_{2}^{\left( 1 \right)} , \ldots ,\theta_{k - 1}^{\left( 1 \right)} } \right)\); the one-step transition from \(\theta^{\left( 0 \right)}\) to \(\theta^{\left( 1 \right)} = \left( {\theta_{1}^{\left( 1 \right)} , \ldots ,\theta_{k}^{\left( 1 \right)} } \right)\) has been completed, where \(\theta^{\left( 1 \right)}\) is a one-time accomplishment of a Markov chain.

  • Step 5. Go to Step2.

After \({\text{t}}\) iterations, \(\uptheta^{{\left( {\text{t}} \right)}} = \left( {\uptheta_{1}^{{\left( {\text{t}} \right)}} , \ldots ,\uptheta_{\text{k}}^{{\left( {\text{t}} \right)}} } \right)\) can be obtained. Each component of \(\uptheta\) can also be obtained. Starting from different \(\uptheta^{\left( 0 \right)}\), as \({\text{t}} \to \infty\), the marginal distribution of \(\uptheta^{{\left( {\text{t}} \right)}}\) can be viewed as a stationary distribution based on the theory of the ergodic average. Then, the chain is seen as converging, and the sampling points are seen as observations of the sample.

3.2 Bayesian Weibull model for TTF

Suppose the time to failure (TTF) data \({\text{t}} = \left( {{\text{t}}_{1} ,{\text{t}}_{2} , \ldots ,{\text{t}}_{\text{n}} } \right)^{\prime}\) for \({\text{n}}\) individuals are i.i.d, and each corresponds to a 2-parameter Weibull distribution \({\text{W}}\left( {\upalpha,\upgamma} \right)\), where \(\upalpha > 0\) and \(\upgamma > 0\). Then, the p.d.f. is \({\text{f}}\left( {{\text{t}}_{\text{i}} |\upalpha,\upgamma} \right) =\upalpha \upgamma {\text{t}}_{\text{i}}^{{{\upalpha} - 1}} { \exp }\left( { - {\upgamma \text{t}}_{\text{i}}^{{\upalpha}} } \right)\), while the c.d.f. is \({\text{F}}\left( {{\text{t}}_{\text{i}} |{\upalpha},{\upgamma}} \right) = 1 - { \exp }\left( { - {\upgamma \text{t}}_{\text{i}}^{{\upalpha}} } \right)\). The reliability function is \({\text{R}}\left( {{\text{t}}_{\text{i}} |{\upalpha},{\upgamma}} \right) = { \exp }\left( { - {\upgamma \text{t}}_{\text{i}}^{{\upalpha}} } \right)\).

Denote the observed data set as \({\text{D}}_{0} = \left( {{\text{n}},{\text{t}}} \right).\) Therefore, the likelihood function for \({\upalpha}\) and \({\upgamma}\) is

$$L\left( {\alpha ,\gamma |D_{0} } \right) = \mathop \prod \limits_{i = 1}^{n} f\left( {t_{i} |\alpha ,\gamma } \right) = \mathop \prod \limits_{i = 1}^{n} \alpha \gamma t_{i}^{\alpha - 1} exp\left( { - \gamma t_{i}^{\alpha } } \right)$$
(3)

In this study, we assume \(\upalpha\) to be a gamma distribution (Kuo 1985), denoted by \({\text{G}}\left( {{\text{a}}_{0} ,{\text{b}}_{0} } \right)\) as its prior distribution, written as \({\uppi}\left( {{\upalpha}|{\text{a}}_{0} ,{\text{b}}_{0} } \right)\); we assume \({\upgamma}\) to be a gamma distribution denoted by \({\text{G}}\left( {{\text{c}}_{0} ,{\text{d}}_{0} } \right)\) as its prior distribution, written as \({\uppi}\left( {{\upgamma}|{\text{c}}_{0} ,{\text{d}}_{0} } \right).\) This means

$$\pi \left( {\alpha |a_{0} ,b_{0} } \right) \propto \alpha^{{a_{0} - 1}} exp\left( { - b_{0} \alpha } \right)$$
(4)
$$\pi \left( {\gamma |c_{0} ,d_{0} } \right) \propto \gamma^{{c_{0} - 1}} exp\left( { - d_{0} \gamma } \right)$$
(5)

Therefore, the joint posterior distribution can be obtained according to Eqs. (3)–(5) as

$$\pi \left( {\alpha ,\gamma |D_{0} } \right) \propto L\left( {\alpha ,\gamma |D_{0} } \right) \times \pi \left( {\alpha |a_{0} ,b_{0} } \right) \times \pi \left( {\gamma |c_{0} ,d_{0} } \right),$$
(6)

and the parameters’ full conditional distribution with Gibbs sampling can be written as

$$\pi \left( {\alpha_{j} |\alpha^{{\left( { - j} \right)}} ,\gamma ,D_{0} } \right) \propto L\left( {\alpha ,\gamma |D_{0} } \right) \times \alpha^{{a_{0} - 1}} exp\left( { - b_{0} \alpha } \right)$$
(7)
$$\pi \left( {\gamma_{j} |\alpha ,\gamma^{{\left( { - j} \right)}} ,D_{0} } \right) \propto L\left( {\alpha ,\gamma |D_{0} } \right) \times \gamma^{{c_{0} - 1}} exp\left( { - d_{0} \gamma } \right)$$
(8)

3.3 Bayesian Lognormal model for TTR

Suppose the time to repair (TTF) data \({\text{t}} = \left( {{\text{t}}_{1} ,{\text{t}}_{2} , \ldots ,{\text{t}}_{\text{n}} } \right)^{\prime}\) for \({\text{n}}\) individuals are i.i.d., and each \({ \ln }\left( {\text{t}} \right)\) corresponds to a normal distribution, \({\text{N}}\left( {{\upmu},{\upsigma}^{2} } \right)\). We can get \({\text{t}}_{\text{i}}\)’s lognormal distribution with parameters \({\upmu}\) and \({\upsigma}^{2}\). Then, the p.d.f. and c.d.f. are given by Eqs. (9) and (10):

$$f\left( {t_{i} |\mu ,\sigma^{2} } \right) = \frac{1}{{\sqrt {2\pi } \sigma t_{i} }}exp\left\{ { - \frac{1}{{2\sigma^{2} }}\left[ {ln\left( {t_{i} } \right) - \mu } \right]^{2} } \right\}$$
(9)
$$F\left( {t_{i} |\mu ,\sigma^{2} } \right) = {\Phi }\left[ {\frac{{ln\left( {t_{i} } \right) - \mu }}{\sigma }} \right]$$
(10)

Denote the observed data set as \({\text{D}}_{0} = \left( {{\text{n}},{\text{t}}} \right)\). Therefore, according to Eq. (9), the likelihood function for \({\upmu}\) and \({\upsigma}\) becomes

$$L\left( {\mu ,\sigma |D_{0} } \right) = \mathop \prod \limits_{i = 1}^{n} f\left( {t_{i} |\mu ,\sigma^{2} } \right)$$
(11)

In this study, we assume \({\upmu}\) to be a normal distribution denoted by \({\text{N}}\left( {{\text{e}}_{0} ,{\text{f}}_{0} } \right)\) as its prior distribution, written as \({\uppi}\left( {{\upmu}|{\text{e}}_{0} ,{\text{f}}_{0} } \right)\); we assume \({\upsigma}\) to be a gamma distribution denoted by \({\text{G}}\left( {{\text{g}}_{0} ,{\text{h}}_{0} } \right)\) as its prior distribution, written as \({\uppi}\left( {{\upsigma}|{\text{g}}_{0} ,{\text{h}}_{0} } \right).\) This means

$$\pi \left( {\mu |e_{0} ,f_{0} } \right) \propto f_{0}^{{\frac{1}{2}}} exp\left[ { - \frac{{f_{0} }}{2}\left( {\mu - e_{0} } \right)^{2} } \right]$$
(12)
$$\pi \left( {\sigma |g_{0} ,h_{0} } \right) \propto \sigma^{{g_{0} - 1}} exp\left( { - h_{0} \sigma } \right)$$
(13)

Therefore, the joint posterior distribution can be obtained according to Eqs. (11)–(13) as

$$\pi \left( {\mu ,\sigma |D_{0} } \right) \propto L\left( {\mu ,\sigma |D_{0} } \right) \times \pi \left( {\mu |e_{0} ,f_{0} } \right) \times \pi \left( {\sigma |g_{0} ,h_{0} } \right)$$
(14)

Then, the parameters’ full conditional distribution with Gibbs sampling can be written as

$${\uppi}\left( {\mu_{j} |\mu^{{\left( { - j} \right)}} ,\sigma ,D_{0} } \right) \propto L\left( {\mu ,\sigma |D_{0} } \right) \times f_{0}^{{\frac{1}{2}}} exp\left[ { - \frac{{f_{0} }}{2}\left( {\mu - e_{0} } \right)^{2} } \right]$$
(15)
$${\uppi}\left( {\sigma_{j} |\mu ,\sigma^{{\left( { - j} \right)}} ,D_{0} } \right) \propto L\left( {\mu ,\sigma |D_{0} } \right) \times \sigma^{{g_{0} - 1}} exp\left( { - h_{0} \sigma } \right)$$
(16)

4 Case study

This section presents a case study; it explains the procedure, gives the results, and offers a discussion.

4.1 The procedure

The procedure applied in this case study to assess the system availability of the mine’s five balling drums has a total of seven steps, as described in Table 2.

Table 2 Steps in the system availability assessment

4.2 Results

In this case study, the calculations are implemented with WINBUGS. A three-chain Markov chain is constructed for each MCMC simulation. A burn-in of 1000 samples is used, with an additional 10,000 Gibbs samples for each Markov chain.

Vague prior distributions are adopted as follows:

  • For Bayesian Weibull model using TTF data:

    $$\alpha \sim G\left( {0.0001,0.0001} \right),\quad \gamma \sim G\left( {0.0001,0.0001} \right)$$
  • For Bayesian lognormal model using TTR data:

    $$\mu \sim N\left( {0,0.0001} \right),\quad \sigma \sim G\left( {0.0001,0.0001} \right).$$

Using the convergence diagnostics [i.e. checking dynamic traces in Markov chains, determining time series and Gelman–Rubin–Brooks (GRB) statistics, and comparing MC error with standard deviation (SD)] (Lin 2014), we consider the following posterior distribution summaries for our models (see Tables 3, 4), including the parameters’ posterior distribution mean, SD, Monte Carlo error (MC error), and 95% highest posterior distribution density (HPD) interval.

Table 3 Posterior statistics in Bayesian Weibull model for TTF
Table 4 Posterior statistics in Bayesian lognormal model for TTR

Using the results from Tables 3 and 4, we calculate the availability of individual balling drums in Table 5, where MTTF = \({\text{E}}\left[ {{\text{f}}\left( {{\text{t}}_{\text{i}} |{\upalpha},{\upgamma}} \right)} \right]\), and MTTR = \({\text{E}}\left[ {{\text{f}}\left( {{\text{t}}_{\text{i}} |{\upmu},{\upsigma}^{2} } \right)} \right]\).

Table 5 Statistics of individual availability

According to Eq. (2), the system availability of the five balling drums is

$$A_{system} = 1 - \mathop \prod \limits_{i = 1}^{5} (1 - A_{i} ) \approx 0.99.$$

4.3 Discussion

Compared to the traditional method of assessing availability in Eq. (1), the proposed approach extends the method to Eq. (17), where

$$A = \frac{{E\left[ {f\left( {TTF} \right)} \right]}}{{E\left[ {f\left( {TTF} \right)} \right] + E\left[ {f\left( {TTR} \right)} \right]}} = \frac{{E\left[ {f\left( {t_{i} |\alpha ,\gamma } \right)} \right]}}{{E\left[ {f\left( {t_{i} |\alpha ,\gamma } \right)} \right] + E\left[ {f\left( {t_{i} |\mu ,\sigma^{2} } \right)} \right].}}$$
(17)

Equation (17) shows the flexibility of assessing availability according to reality. For one thing, the parametric Bayesian models using MCMC make the calculation of posteriors more feasible. More importantly, however, parametric Bayesian models can be applied to predict TTF, TTR, and system availability in the future.

In this study, since the five balling drums are relatively new, the gamma distributions and normal distributions are selected as vague priors due to lack of prior information. This could be improved with more historical data/experience.

The system configurations could be extended to other more complex architectures (series, k-out-of-n, stand-by, multi-state, or mixed) by modifying Eq. (2).

The data analysis reveals that for TTF data, the shape parameter for the Weibull distribution is less than 1. The TTFs have a decreasing trend (as in an early stage of the bathtub curve) which is not suitable for the experience of mechanical equipment. The TTF data include not only corrective maintenance but also preventive maintenance. In this case study, a high percentage of TTF work orders are for preventive maintenance. The decreasing trends also indicate that a possible way to improve TTF is to improve the preventive maintenance plan.

Among those three stages, Step 1 to Step 4 can be treated as Plan stage; Step 5 and Step 6 as Do and Check stage, while Step 7 as Action stage. The outputs from Step 7 could become input for Step 2 for the next calculation period. It means these eight steps are following the “PDCA” cycle and the results could be continuously improved.

5 Conclusions

This study proposes a parametric Bayesian approach for system availability assessment on the operational stage. MCMC is adopted to take advantages of the analytical and simulation methods.

In this approach, MTTF and MTTR are treated as distributions instead of being “averaged” by a point estimation. This better reflects the reality; in addition, the limitations of simulation data sample size are compensated for by MCMC techniques.

In the case study, TTF and TTR are determined using a Bayesian Weibull model and a Bayesian lognormal model. The results show that the proposed approach can integrate the analytical and simulation methods for system availability assessment and could be applied to other technical problems in asset management (e.g., other industries, other systems).