Introduction

With the advancement in technology, the use of automated machines in many areas of practical utility has become common. The machines have entered every nook and corner of our life; thus there is dependence of every one on them. An interruption due to machine failure not only affects the quality of the service facilitated by the machines, but also increases the cost of operation of machining system. The machine interference is one of the key problems in many industries such as manufacturing systems, communication systems, computer systems, transportation, etc. The interference in normal functioning occurs when a machine stops and will not resume its operation until it is attended by the repairman. When a repairman finds more machines to repair than his capacity at a time, the problem of machine interference arises. Due to cost and technical constraints, the trade-off between the repairman staffing level and the magnitude of machine interference has become an important issue and has drawn the attention of many queue theorists who considered the machine interference problem as finite source queueing model. In recent past, the contributions of Jain (1997), Yang et al. (2005), Ke and Lin (2008), Jain et al. (2008) and many more are worth-noting in this regard. The survey on the machine interference problems was done by Haque and Armstrong (2007) and Jain et al. (2010). Wang et al. (2013) performed a comparative analysis of the machine repair problem with imperfect coverage and service pressure condition.

The loss of production while the broken-down machines are under attention of repairman can be reduced to some extent by providing spare part support. Based on failure characteristic, the spare units can be categorized into three types (1) cold (2) warm and (3) hot. While not in use, the cold spare units do not fail, whereas the failure of warm (hot) spare units is less than (equal to) the failure rate of operating units. Several papers have appeared in queueing and reliability literature which explored various aspects of machine repair problems with standby in different contexts. Sivazlian and Wang (1989), Wang (1995), Wang and Kuo (2000) and Wang and Chang (2002) did the cost and probabilistic analysis of machining system with standby components. Yuan and Meng (2011) did reliability analysis of a warm standby repairable system with priority in use. Jain (2013) suggested numerical approach based on Runge–Kutta method to compute the transient performance indices of machining systems with mixed standbys by incorporating the features of service interruption and priority.

For the modeling of queueing problems, many researchers developed more realistic queueing models by incorporating the concept of server vacation. However, a limited number of papers have appeared on machine repair models with spares provisioning by including the feature of vacation of the repairmen. Gupta (1997) considered machine interference problem with warm spare, server vacations and exhaustive service. Ke and Wang (2007) suggested the vacation policies for machine repair problem with two types of spares. Wang et al. (2006, 2009) suggested optimal management of the machine repair problem with working vacation and used Newton’s method for the solution purpose. Ke et al. (2011) made an algorithmic analysis of unreliable server machine repair system with spares by developing multi-server synchronous vacation model with service interruptions due to server failure. Ke and Wu (2012) and Ke et al. (2013) developed a multi-server machine repair model with standbys and synchronous multiple vacations.

To deal with more realistic scenarios of machining system, the behavior of the customers and care taker should be taken into account for the performance analysis of such systems. Machine repair problems by incorporating the concepts of balking and reneging have been investigated by many researchers (Shawky 1997, 2000; Wang and Ke 2003; Jain et al. 2003; Sharma et al. 2004). Wang et al. (2011) performed cost benefit analysis of a machining system with warm standby components and variable server by incorporating the concept of balking. In recent past, the concept of common-cause failure which can be realized in many real time machining systems has been studied extensively (Platz 1984; Mosleh 1991; Pan and Nonaka 1995). The redundancy provision in K-r-out of-N: G configuration machining system under the assumption of common-cause failure has been investigated by Reddy (1993), Jain and Ghimire (1997), Jain et al. (2002), and many more. Jain and Mishra (2006) analyzed system characteristics of multistage degraded machining system with common-cause shock failure and state dependent rates. El-Damcese (2009) investigated the performance indices of warm standby systems subject to common-cause failures with time varying failure and repair rates. The effect of common-cause failures as major issue in safety of machining systems was examined by Ilavsky et al. (2013). Mishra and Jain (2013) studied the effect of common-cause failure on the maintainability of a deteriorating system having the inspection provision.

Due to wear and tear or any other technical fault, the servers may be prone to partial or complete failures. In case of partial failure, the servers continue to operate but their failure rate increases which further leads to fully failure state of the system. The overload due to functioning of less number of components which are required for normal operation also causes the adverse effect on the performance of the machining system. In machine repair system, the server providing repair to the failed machines may breakdown due to over load or long run operation. Ke and Lin (2008) discussed sensitivity analysis of machine repair problems in manufacturing systems with service interruptions due to server failure. Yue et al. (2009) studied a heterogeneous two-server queueing system with balking and server breakdowns. In many multi-component machining systems, the target of high availability using redundancy is rather difficult and some time impossible. The high availability and efficiency of a machining system can also be enhanced by improving its maintainability. To reduce the workload of failed units in such systems and to achieve pre-specified availability, the better maintenance facility can be provided with the provision of additional repairman. In real time systems, the working capacity of two repairmen may not be same. Jain et al. (2004) have studied a (N, L) switch-over policy for two heterogeneous repairmen machine repair model with warm standbys and vacation. The provision of two heterogeneous servers and vacation was considered by Kumar and Jain (2013) for the (m, M) machine repair problem with spares and switching failure.

In this investigation, we develop (m, M) Markov model for multi-component machining system with mixed warm standby provisioning and under the care of two heterogeneous servers. To make model more realistic, we consider that the primary server as well as secondary server are unreliable and subject to breakdown individually or simultaneously due to common cause. The primary server can also work with slower rate in case of partial failed condition. To illustrate the practical applicability of our model, we give the example of a power plant having M operating nuclear turbine generators (i.e. base units) and S 1 and S 2 standby units of gas turbine generators of type 1 and 2 (having different failure characteristics). The type 1 standby unit i.e. gas turbine generator is used first in case of failure of any operating generator. When all S 1 of type 1 gas turbine generators are used and further any other operating generator fails, we replace it by type 2 gas turbine generators if available. In case when all the standby generators of both types are used to replace the failed generator and there are less than M but at least m generators are functioning in the power plant, the failure rate of operating generator increases due to overload. There is a provision of two dissimilar servers who provide repair of the failed generators with different rates. The life times of nuclear/gas turbine generators and servers are exponentially distributed. The server who is responsible for maintenance of system may also be unavailable due to illness or pre-commitment to some other job. Both primary as well as secondary servers may become unavailable individually or simultaneously due to some common cause. The primary service engineer may be available partially and will provide repair of the failed generator with reduced rate. From partial available state either he becomes fully available after some treatment or goes for complete rest and becomes completely unavailable to provide repair of the failed generator. The secondary service engineer either provides repairs if available or may become completely inoperative due to breakdown. The unavailable service engineers can restore its repair capability after some random interval of times which are exponentially distributed.

For the performance modeling and queueing analysis of the concerned machine repair problem, the investigation done is organized as follows. The model description by stating the requisite assumptions and notations is presented in “Model description and assumptions”. In “Steady state equations”, the governing equations are constructed with the help of state dependent failure and repair rates. Various performance measures in terms of steady-state probabilities are obtained in “Some performance indices”. The sensitivity analysis has been performed to examine the effect of various parameters on the system performance in “Numerical results”. Finally, conclusions are drawn in “Conclusion”.

Model description and assumptions

In this section, we give the machining system description by clearly throwing light on the various components of the system. For the mathematical modeling, the basic factors associated with the machine repair problem under consideration have been stated in terms of requisite assumptions and notations.

Consider an (m, M) machining system having two servers. The primary server can fail partially as well as fully whereas secondary server can fail completely. To support the system, the provision of two type warm standbys is made. The life times of operating units, standby units and servers are exponentially distributed. The failure rate of type 1(2) standby units is \( \alpha_{1} (\alpha_{2} ) \) which is less than that of failure rate λ of operating units. The system can also fail due to common cause. We use the following other assumptions to formulate the model mathematically:

  • When primary server is functioning, the secondary server works as standby.

  • Both servers provide repair according to exponential distribution.

  • The primary server can work in normal and degraded mode (i.e. partially failed state) both whereas secondary server can work only in normal mode.

  • If primary server fails partially, the secondary server turns on if there is any machine to be repaired and turns off when the queue becomes empty.

  • There is a need of repair to the broken-down server to restore its operating state.

  • The repair time of broken-down server is assumed to be exponentially distributed.

  • The type 1 standby (warm) units if available replace the failed operating units, and then its characteristic is the same as that of operating units. In case when type 1 standby units are exhausted, type 2 standby unit is used to replace the failed operating units.

  • When both types of standby units have been used, and operating units fail, the system will be in functioning state in degraded mode till there are at least m(<M) operating units present in the system. As soon as there are less than m operating units in the machining system, it fails.

  • The switch-over time from standby to operating state of the units is assumed to be negligible.

To describe the model, the following notations are used:

M :

Number of operating units in the system

S 1 (S 2 ):

Number of warm spare units of type 1(2) in the system

M :

Minimum number of operating units required for the system to function

λ(λ c):

Failure rate (common-cause failure rate) of operating units in the system

λ d :

Degraded failure rate of operating units when all warm spares are utilized

b c(b cp):

Common-cause failure rate of first server is in working (partially failed) state

b 1(b 2):

Failure rate of first (second) server

\( b^{\prime}_{{ 1 p}} \) :

Partially failure rate of first server when second server is in breakdown state

b 1p :

Partially failure rate of first server when second server is in working state

r c :

Common cause repair rate of both servers

r cp :

Common cause repair rate of partially failed first server when second server is in working state

r 1(r 2):

Repair rate of first (second) server

μ 1 :

Repair rate of the servers when first server is in working state and second server is in breakdown state

μ 2 :

Repair rate of servers when first server is in breakdown state and second server is in working state

P i,0,1 :

Probability that there are i failed units in the system and first server is in breakdown state while the second one is in working state

P i,0,0 :

Probability that both servers are in breakdown state and there are i failed units in the system

P i,1,1 :

Probability that both servers are in working state and there are i failed units in the system

P i,1,0 :

Probability that there are i failed units in the system and first server is in working state while the second one is in breakdown state

P i,p,1 :

Probability that there are i failed units in the system and first server is in partially failed state while the second one is in working state

P i,p,0 :

Probability that there are i failed units in the system and first server is in partially failed state while the second one is in breakdown state

Steady state equations

In this section, the mathematical formulation of the machine repair problem under consideration is done by constructing Chapman Kolmogov equations for the system state probabilities. To construct the difference equations governing the model, we define the failure rates as follows:

$$ \lambda_n = \left\{ \begin{gathered} M\lambda + (S_{1} - n)\alpha_{1} + S_{2} \alpha_{2} + \lambda_c ,\quad 0 \le n < S_{1} \hfill \\ M\lambda + (S_{1} + S_{2} - n)\alpha_{2} + \lambda_c ,\quad S_{1} \le n < S_{1} + S_{2} \equiv S \hfill \\ (M + S - n)\lambda_d + \lambda_c ,\quad \quad \quad S \le n \le M + S - m = K - 1 \hfill \\ 0,\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad {\text{Otherwise}} \hfill \\ \end{gathered} \right. $$

Chapman Kolmogorov equations governing the model (see Fig. 1) are given by

Fig. 1
figure 1

Steady state transition diagram

$$ \left[ {\lambda_{0} + b_c + b_{1} + b_{2} + b_{{1p}} } \right]P_{0,1,1} = \mu_{1} P_{1,1,1} + r_{2} P_{0,1,0} + r_{1} P_{{0,p,1}} + r_c P_{0,0,0} + r_{1} P_{0,0,1} $$
(1)
$$ \left[ {\lambda_n + b_c + b_{1} + b_{2} + b_{{1p}} + \mu_{1} } \right]P_{n,1,1} = \lambda_{n - 1} P_{n - 1,1,1} + \mu_{1} P_{n + 1,1,1} + r_{2} P_{n,1,0} + r_{1} P_{n,0,1} + r_c P_{n,0,0} , \quad 1 \le n < M + S - m$$
(2)
$$ \left[ {b_c + b_{1} + b_{2} + b_{{1p}} + \mu_{1} } \right]P_{K,1,1} = \lambda_{K - 1} P_{K - 1,1,1} + \mu_{1} P_{K + 1,1,1} + r_{2} P_{K,1,0} + r_{1} P_{K,0,1} + r_c P_{K,0,0} $$
(3)
$$ \left[ \lambda_{0} + b_{1} + r_{2} + b^{\prime}_{{1{p}} } \right]P_{0,1,0} = \mu_{1} P_{1,1,0} + b_{2} P_{0,1,1} + r_{1} P_{0,p,0} + r_{1} p_{0,0,0} $$
(4)
$$ \left[ {\lambda_n + b_{1} + r_{2} + b^{\prime}_{{1p}} + \mu_{1} } \right]P_{n,1,0} = \lambda_{n - 1} P_{n - 1,1,0} + \mu_{1} P_{n + 1,1,0} + b_{2} P_{n,1,1} + r_{1} p_{{n,p,0}} + r_{1} P_{n,0,0} ,\quad 1 \le n \le M + S - m $$
(5)
$$ \left[ {b_{1} + b^{\prime}_{{1p}} + \mu_{1} + r_{2} } \right]P_{K,1,1} = \lambda_{K - 1} P_{K - 1,1,1} + b_{2} P_{K,1,1} + r_{1} P_{{K,{\textit{p}},0}} + r_{1} P_{K,0,0} , $$
(6)
$$ \left[ {\lambda_{0} + b_cp + b_{1} + b_{2} + r_{1} } \right]P_{{0,{\textit{p}},1}} = \mu_{2} P_{{1,{\textit{p}},1}} + r_{2} P_{{0,{\textit{p}},0}} + b_{{1{\textit{p}}}} P_{0,1,1} + r_{1} p_{0,0,1} + r_cp P_{0,0,0} $$
(7)
$$ \begin{aligned} \left[ {\lambda_{n} + b_cp + b_{1} + b_{2} + \mu_{2} + r_{1} } \right]P_{{n,{\textit{p}},1}} = & \lambda_{n - 1} P_{{n - 1,{\textit{p}},1}} + \mu_{2} P_{{n + 1,{\textit{p}},1}} + r_{2} P_{{n,{\textit{p}},0}} + b_{{1{\textit{p}}}} P_{n,1,1} \\ & + r_{1} P_{n,0,1} + r_cp P_{n,0,0} ,\quad 1 \le n \le M + S - m \\ \end{aligned} $$
(8)
$$ \left[ {b_cp + b_{1} + b_{2} + \mu_{2} } \right]P_{{K,p,1}} = \lambda_{K - 1} P_{{K - 1,p,1}} + r_{2} P_{{K,p,0}} + b_{{1p}} P_{K,1,1} + r_{1} p_{n,0,1} + r_cp P_{K,0,0} $$
(9)
$$ \left[ {\lambda_{0} + r_{2} + r_{1} + b^{\prime}_{{1p}} } \right]P_{{0,p,0}} = \mu_{1} P_{{1,p,0}} + b_{2} P_{{0,p,1}} + b^{\prime}_{{1p}} P_{0,1,0} $$
(10)
$$ \left[ {\lambda_{n} + r_{2} + r_{1} + \mu_{1} + b^{\prime}_{{1p}} } \right]P_{{n,p,0}} = \lambda_{n - 1} P_{{n - 1,p,0}} + \mu_{1} P_{{n + 1,p,0}} + b_{2} P_{{n,p,1}} + b^{\prime}_{{1p}} P_{n,1,0} ,\quad 1 \le n \le M + S - m $$
(11)
$$ \left[ {r_{2} + b^{\prime}_{{1p}} + \mu_{1} + r_{1} } \right]P_{{K,p,0}} = \lambda_{K - 1} P_{{K - 1,p,0}} + b_{2} P_{{K,p,1}} + b^{\prime}_{{1p}} P_{K,1,0} $$
(12)
$$ \left[ {\lambda_{0} + r_{1} + b_{2} } \right]P_{0,0,1} = \mu_{2} P_{1,0,1} + b_{1} P_{{0,p,1}} + r_{2} P_{0,0,0} $$
(13)
$$ \left[ {\lambda_{n} + r_{1} + b_{2} + \mu_{2} } \right]P_{n,0,1} = \lambda_{n - 1} P_{n - 1,0,1} + \mu_{2} P_{n + 1,0,1} + b_{1} P_{{n,p,1}} + r_{2} P_{n,0,0} ,\quad 1 \le n \le M + S - m $$
(14)
$$ \left[ {r_{1} + b_{2} + \mu_{2} } \right]P_{K,0,1} = \lambda_{K - 1} P_{K - 1,0,1} + b_{1} P_{K,p,1} + r_{2} P_{K,1,0} $$
(15)
$$ \left( {\lambda_{0} + r_{1} + r_{2} + r_{c} + r_cp } \right)p_{0,0,0} = b^{\prime}_{{1p}} p_{{0,p,0}} + b_{1} p_{0,0,1} + b_{2} p_{0,0,1} + b_c p_{0,1,1} + b_cp p_{{0,p,1}} $$
(16)
$$ \left( {\lambda_{n} + r_{1} + r_{2} + r_c + r_cp } \right)p_{n,0,0} = \lambda_{n - 1} p_{n - 1,0,0} + b^{\prime}_{{1p}} p_{{n,p,0}} + b_{1} p_{n,0,1} + b_{2} p_{n,0,1} + b_{c} p_{n,1,1} + b_cp p_{{n,p,1}} ,\quad 1 \le n \le M + S - m $$
(17)
$$ \left( {r_{1} + r_{2} + r_c + r_cp } \right)p_{K,0,0} = \lambda_{K - 1} p_{K - 1,0,0} + b^{\prime}_{1p} p_{{K,p,0}} + b_{1} p_{K,0,1} + b_{2} p_{K,0,1} + b_{c} p_{K,1,1} + b_cp p_{{K,p,1}} $$
(18)

The steady-state difference equations constructed in previous section can be put in the form AX = B i.e. the matrix form of the system of linear equations. This system of linear equations has been solved using the numerical technique successive over relaxation (SOR) method. This technique is an extrapolation to Gauss–Seidal method, which accelerates the convergence rate by taking the relaxation parameter \( w > 1 \) (more specifically \( w = 1.25 \)) which is unity in case of Gauss–Seidal method.

Some performance indices

For the efficient machining system, the designers/developers chalk out the plan of maintainability and redundancy based on the performance analysis. For the performance prediction of machining system, it is important to provide the expressions for key indices including the queue length. The queue length in the machine repair problem refers the total number of failed machines waiting for repair in the queue including those which are in the process of repair with the server. Now we provide the explicit results in terms of probabilities for some performance measures as follows:

  • The expected number of failed machines in the queue is

    $$ E(n) = \sum\limits_{i = 0}^{M + S - m + 1} {i\left( {P_{i,1,1} + P_{i,1,0} + P_{i,p,1} + P_{i,p,0} + P_{i,0,1} } \right)} $$
    (19)
  • The probability that both servers are in working state is given by

    $$ P(w) = \sum\limits_{i = 0}^{M + S - m + 1} {P_{i,1,1} } $$
    (20)
  • The probability that both servers are in breakdown state is given by

    $$ P(b) = \sum\limits_{i = 0}^{M + S - m + 1} {P_{i,0,0} } $$
    (21)
  • The probability that the first server is in working state but secondary server is in breakdown state, is

    $$ P(s_{1} ) = \sum\limits_{i = 0}^{M + S - m + 1} {P_{i,1,0} } $$
    (22)
  • The probability that the secondary server is in working state but primary server is in breakdown state is

    $$ P(s_{2} ) = \sum\limits_{i = 0}^{M + S - m + 1} {P_{i,0,1} } $$
    (23)
  • Throughput is obtained as

    $$ T\left( p \right) = \mu_{2} \sum\limits_{i = 1}^{M + S - m + 1} {\left[ {P_{i,0,1} + P_{i,p,1} } \right] + \mu_{1} \sum\limits_{i = 1}^{M + S - m + 1} {\left[ {P_{i,1,0} + P_{i,1,1} + P_{1,p,0} } \right]} } $$
    (24)
  • Expected waiting time of failed units in the system is determined using Little formula given by

    $$ E\left( W \right) \, = \, E\left( n \right)/\lambda_{\text{eff}} , $$
    (25)
    $$ {\text{where}}\;\lambda_{\text{eff}} = \sum\limits_{i = 0}^{M + S - m} {\lambda_{i} \left( {P_{i,1,1} + P_{i,1,0} + P_{{i,p,1}} + P_{{i,p,0}} + P_{i,0,1} } \right)}. $$
    (26)

Numerical results

Numerical results based on numerical simulation can provide quantitative assessment of understanding of the performance indices. The effect of different parameters on the performance indices can also be explored by numerical simulation. In this section, the sensitivity analysis has been carried out to analyze the trend of the system descriptors as detailed below.

To compute the numerical results, we consider the illustration of power plant as described in the introduction. The power plant consists of M = 6 operating nuclear turbine generators and S 1 = 2 and S 2 = 3 standby gas turbine generators of type 1 and 2 having same failure characteristics. The failure rate of operating nuclear turbine generators is \( \lambda = 0.3 \), and failure rates of standby gas turbine generators of type 1 and 2 are \( \alpha_{1} = 0.9 \) and \( \alpha_{2} = 0.9 \), respectively. For computational purpose, the program has been coded in MATLAB software for other default parameters chosen as b 1 = b 2 = 0.5, \( \alpha_{c} = 0.0 \) and r 1 = r 2 = r = 1.3. The expected queue length E(n) against failure rate (λ) of operating units by varying different parameters such as number of operating units (M), minimum number of operating units (m), number of warm standbys (S), repair rate (r), failure rate of standbys (α) and server’s breakdown rate (b) has been displayed in Fig. 2a–f, respectively.

Fig. 2
figure 2

Queue length E(n) vs failure rate of operating units for different values of a M b m c S d r e \( \alpha \) f b

  1. 1.

    Effect of failure rate (λ, α) and repair rate (r)

Figure 2a–f reveal the effect of λ on the queue length for the variation of different parameters. It is noticed that on increasing λ, the queue length of failed units in the system increases. Figure 2e demonstrates the effect of failure rate (α) of the standby units on the queue length, an increasing pattern of the queue length with respect α matches with our expectation.

  1. 2.

    Effect of repair rate (r)

In Fig. 2d, the expected number of failed units in the system seems to decrease as we increase the repair rate (r). By improving the repair facility in terms of faster repair, one may improve the system availability as there will be reduction in the number of failed units in the system.

  1. 3.

    Effect of number of operating units (M) and minimum required operating units (m)

In Fig. 2a and b, the effect of the number of operating unit (M) and minimum required operating units (m) on the queue length are shown. As we expect, in both figures, the queue length increases with the increase in M and m. This is due to the fact that as the number of operating units in the system is large, the system has more units as such the number of failed units will increase. It is seen in Fig. 2a that the increment in the queue length with respect to M is more remarkable for higher values of λ due to increase in traffic load.

  1. 4.

    Effect of number of warm standbys (S)

Figure 2c displays the effect of increment in the number of spares (S) on the queue length. It is found that the queue length increases slowly by the increment in S for starting values of λ, however, for the higher values of λ, a more significant increment in the queue length is found. The reason behind the adverse effect of S on the queue length is attributed to the increase in population size of the total number of units in the system.

  1. 5.

    Effect of server’s breakdown rate (b)

The adverse effect of server breakdown rate (b) on the queue length is clear from Fig. 2f where we notice the increasing trend of queue length with the increase in b. This shows that due to the server breakdown, the repair of failed units is adversely affected.

Overall, we conclude that

  • By increasing the number of units required for normal operation or least number of units required for operation, we see the increment in the queue length.

  • Frequent breakdown of the server also results in higher queue length; however, the queue size comes down by increasing the repair rate. These patterns tally with the realistic situations.

Conclusion

In many industries the operation can be interrupted because of the occurrence of failure of machines and breakdown in the repair facility. In this investigation, we have developed a finite population queueing model of multi-component machine repair system wherein individual component failure and common-cause failure may occur. It is essential for the smooth running of any machining system to control the system failure by employing the suitable repair and spare part support strategy. To cope up with the failure and to achieve the goal of high performance of machining system, the provision of a repair crew having two dissimilar unreliable servers and spare part support are taken into consideration. The provision of single type spare units is common to ensure smooth running of the system, but we have considered the mixed warm standbys; the reason behind this feature is some physical constraints such as volume, weight, cost, wait-space, etc. which limits in providing single type of spare units. Various performance characteristics established for the concerned system with two types of spares give insights for more versatile situations of real time systems operating in multi-component environment and subject to component failures, common-cause failures and server failures. The numerical simulation and sensitivity analysis performed may be helpful to visualize the effect of different parameters on the performance measures. The model can be further extended by including the concepts of group failure and switching failure.