Fault tolerant system with imperfect coverage, reboot and server vacation

This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe state from which it is cleared by the reboot and recovery action. The server is allowed to go for vacation if there is no failed unit present in the system. Markov model is developed to obtain the transient probabilities associated with the system states. Runge–Kutta method is used to evaluate the system state probabilities and queueing measures. To explore the sensitivity and cost associated with the system, numerical simulation is conducted.


Introduction
The performance of any fault tolerant system such as computer or communication system, power plant or transmission lines, manufacturing or production and many more, is highly affected by the failure of its operating units. The breakdowns of the units may cause not only loss of production but also increase in cost. To avoid this loss and extra cost, the organization or industry can make the provision of standbys and repair facility. In some cases, if the operating unit fails, the system remains operative and continues to perform its assigned job due to switching over of failed unit by the spare part. For the smooth functioning and to achieve desired availability of the concerned system, the concepts of redundancy and maintainability have drawn the attention of practitioners as well as researchers. To reduce the maintainability and operating cost, the provision of server vacation is a key feature which has been included in many queueing models to analyze the congestion problems in different contexts. The machine repair problems with server vacations have been investigated and applied extensively in many areas including the computer and communication networks, manufacturing and inventory control processes, transportation and service sectors, etc. In many queueing scenarios of machining systems, the server is allowed to go for vacation if the system is empty. A noticeable amount of recent past works on the server vacation concept has been done by many eminent researchers (cf. Doshi 1986). Gupta (1997) proposed several server vacation schemes such as multiple vacations, single vacations, hybrid multiple/single vacation for the machine interference problems. Machine repair models with different type of policies for the server vacation have been investigated and studied by a few researchers (cf. Ke and Wang 2007;Ke et al. 2011;Ke and Wu 2012). Jain and Upadhyaya (2009) investigated the performance of multi-component machining system with multiple vacations of the servers, multiple types of redundant components and operating under N-policy. They obtained the probabilities for the system states and various key metrics by implementing the matrix recursive method. The queue size distribution has been established by Kumar and Jain (2013) for both F-policy and N-policy by employing the recursive method. Recently, Jain et al. (2015) developed multi-component machine repair model with two heterogeneous servers and having the facility of warm standby units along with on-line units. They derived the steady state queue size distributions and other measures of performance with the implementation of successive over relaxation technique.
The timings of reboot operation may vary from a few seconds to long hours, depending upon the complexity of the machining system. In many industries, an extensive loss of production as well as cost occurs due to the failure of some malicious components if not tackled properly with the help of suitable mechanism. But in some practical situations, the fault handling device may prove inadequate to recover a fault perfectly. These types of situations are called as imperfect coverage. In literature, some research works can be found on the reliability analysis of the machine repair problems with imperfect fault coverage. In this context, we cite some recent works which are relevant to present investigation. The concept of the reboot was discussed by Trivedi (2008) for the analysis of some reliability models in his book on 'Probability and Statistics with Reliability, Queueing and Computer Science'. Ke et al. (2008) has done the performance analysis of a repairable system by including the features of detection, imperfect coverage and reboot. A statistical model for a standby system involving reboot, switch failure and unreliable repair was presented by Hsu et al. (2011). The queueing and reliability indices of the machine repair systems with imperfect coverage and reboot have been studied by Jain et al. (2012), Jain (2013), Jain and Gupta (2013), Wang et al. (2013), and many more. Further, Ke and Liu (2014) investigated the machine repair system with imperfect coverage incorporating the reboot delay concept by taking the illustration of gamma and exponential time distributions.
In the present scenario, we notice that most of the studies devoted to the queueing analysis of server vacation are restricted to the reliable server queueing model, but in real life, no server can be reliable as such the incorporation of an unreliable server concept will be helpful to portray the more versatile queueing scenarios. In this direction, a few researchers have contributed by developing Markovian models with unreliable and vacation servers. Ke et al. (2009) developed a Markovian model having a finite buffer of the multi-server queueing system in which the servers are unreliable and allowed to go for vacation on the basis of (d, c) vacation policy. Further, Ke et al. (2013) presented the queueing analysis of unreliable multi-repairmen machining system comprising of operating machines and warm spares by incorporating the concept of imperfect coverage and reboot action. They determined the queue size distributions by employing the matrix recursive approach.
Further, Jain et al. (2014) analyzed N-policy model for the multi-component machining system under the realistic features of imperfect coverage, reboot and unreliable server. Yang and Wu (2015) investigated the N-policy for M/M/1 working vacation queueing model by considering the server breakdowns. They employed the particle swarm optimization algorithm to optimize the cost function and determined the optimal parameters.
Among various available soft computing techniques, neuro-fuzzy technique which is a hybrid of fuzzy logic and neural network, has widely used for the performance of complex systems for which analytical formulae can not be framed. By using the feature of a neural network and fuzzy inference system, some researchers have developed the adaptive neuro-fuzzy inference system (ANFIS) controller for the performance analysis of various embedded systems in different frameworks (cf. Lin and Liu 2003;Yang and Zhao 2012). For the performance prediction of degraded multi-component system with standby switching failure, N-policy and multiple vacations, Jain and Kumar (2013) used ANFIS to match the soft computing based results with the results obtained numerically using successive over relaxation (SOR).
In the present investigation, we provide the performance indices of the machining system supported by a repair facility and mixed standby units operating under vacation policy. A few research papers on the machine repair problem with vacation policy in different frameworks have appeared in the past few years as mentioned earlier. But to the best of authors' knowledge, no research article explores the transient study of the machine repair problem combined with vacation, mixed standbys, imperfect coverage and server breakdown. Further, the implementation of ANFIS technique to match soft computing results with the analytic results makes our study to deal with complex dynamic behavior of the machining system in efficient computational manner by incorporating many realistic features. The remaining contents of the paper are structured in different sections. In section. ''Model description'', we provide the notations and assumptions to formulate Markov model. In section ''Governing equations'', Chapman-Kolmogorov equations are constructed for the transient state to develop Markov model which are further solved numerically based on the Runge-Kutta method. In section ''Performance indices'', some performance indices have been established explicitly by using the transient probabilities of the system states. In section ''Cost function'', the cost function is constructed. In section ''Numerical illustration'', numerical illustration and sensitivity analysis are provided. The outcomes of the numerical results are displayed in graphs and tables to explore the effect of system descriptors on the performance indices. Finally, the conclusion is drawn in section ''Conclusion''.

Model description
In this section, we develop a Markov queueing model by defining the appropriate transition rates of the concerned birth-death process for the performance analysis of fault tolerant system. The assumptions and notations used for developing the model are as follows: • The system consists of M operating and S n (1 B n B l) of nth type standby units. • The operating units are prone to failure and have the life time exponentially distributed with mean 1/k. When an operating unit fails, it is immediately backed up by an available standby unit. When a standby unit moves into an operating state, it has the same failure characteristic as that of an operating on-line unit. • The nth type standby unit may also fail; the life time of nth type unit follows the exponential distribution with mean 1/a n , (0 B a n B k). • When all the spares are used, the system operates in short (i.e., degraded) mode with at least m (\N) operating units and its failure rate becomes k d ([k). • The failed units are repaired by the repairman in the same order in which failure occur, i.e., repair is performed by following the FIFO service discipline. The repair of failed units is rendered by the server according to exponentially distribution with rate l. • The operating unit can be successfully recovered with probability c; the recovery time of operating unit is exponentially distributed with parameter r. • The server is allowed to go for vacation if there is no work load of repair job of failed units in the system and returns back from the vacation as soon as any unit fails.
To develop the model, Markov process for the three mutually exclusive system states, i.e., (1) operating state, (2) recovery state and (3) reboot state is taken into account. The transient probabilities of the failed units at time t for different states are defined for the scenarios when the server is in (1) vacation state, (2) busy state, and (3) broken down state. Let P i, j, k (t) denote the probability that there are i, (1 B i B L) failed units when the server is in jth (j = 0, 1, 2) state and the system is operating in kth (k = 0, 1, 2) mode. Here indices j = 0, 1, 2 when the server is on vacation, busy and broken down states, respectively. Also indices k = 0, 1, 2 represent the system status in operating state, recovery state and reboot state, respectively.
The failure rate of the operating units which depends upon the number of already failed units is given by: where S (l) = P n=1 l S (n) and k d is the degraded failure rate.
The failure rate of the standby units is given by: S y m y þ ðS ðxÞ À iÞm x ; S ðxÀ1Þ i\S ðxÞ ðS ðlÞ À iÞm l ; S ðlÀ1Þ i\S ðlÞ 0; S ðlÞ i\L:

Governing equations
For evaluating the probabilities associated with different server states, Kolmogorov Chapman equations have been constructed using the transition flow rates of birth death process specifying the Markov model. The state transition for in-flow and out-flow rates of specific model when M = 5, S 1 = 2, S 2 = 1, l = 2, m = 1 is shown in Fig. 1.
1. Server vacation state when j ¼ 0; k ¼ 0: As soon as the server becomes free when there is no job of repairing the failed machines in the system, it reaches to vacation state (0, 0, 0). In this case, the server is in vacation state and reaches to other state using appropriate transition rates. Now, we frame the Chapman-Kolmogorov equation for state (0, 0, 0) as follows: dP 0;0;0 ðtÞ dt ¼ Àðk 0 þ a 0 ÞP 0;0;0 ðtÞ þ lP 1;1;0 ðtÞ: The server returns back to busy state with rate hwhen a failed machine joins the system; in between, some more mahines may fail so that during vacation period, the system states may be (i, 0, 0), i = 1, 2,…, L.
(3)-(19) have been solved numerically using Runge-Kutta 4th order method, which is a powerful tool to solve the ordinary differential equations of first order. It is a good choice to employ this technique to solve the set of differential equations governing the system state probabilities. It is worth noting the R-K method is quite accurate, stable and easy to implement in comparison to other methods available to solve the differential equations. For the coding purpose, we have chosen this particular method here and MATLAB's 'ode45' function is exploited for the programming purpose.

Performance indices
To analyze the transient system behavior, we derive various performance indices using the probabilities which can be evaluated as described in previous section. The expressions for the expected number of failed units in the system, failure frequency of the system, availability of the server and the system state probabilities for the server being in different states and other performance metrics are established as follows: 1. The average number of failed units in the system at time t, is: i P i;j;1 ðtÞ þ P i;j;2 ðtÞ È É : 2. Failure frequency of the server at time t, is: 3. System availability of the at time t, is: 4. The transient probability that the system is in recovery state:

Cost function
The system designer may be interested to determine the proper combination of the spares and repairmen so as the total cost incurred on the system should be minimum. While making such decisions, it should be kept in mind that the organization should not bear the burden of excessive costs of keeping the spares and repairmen. To determine the optimal number of repairmen and spare machines, we construct the cost function by considering various cost elements involved in different activities. We denote the various cost elements incurred on different activities as follows: C V Cost per unit time of the server when he is on vacation; C V Cost per unit time of the server when he is in busy state; C H Holding cost of one failed unit per unit time in the system; C BD Cost of repairing of a broken down server per unit time Now we evaluate the total cost function by considering the above cost elements and respective performance measures as follows:

Numerical illustration
To reveal the practical applicability of the underlying model in real time machining system, we consider an illustration from flexible manufacturing systems where robots are used for the packing purpose. In normal functioning mode, the system requires five robots, however, the system can work in degraded mode, i.e., in short mode if there are less than five but at least one operating robot is in active state. The system has two types of standby robots which acts as back up unit in case of failure of any operating unit and immediately put in place of broken down robot. The failure rate of the operating robot is k = 0.1 robots per day. To maintain the desired level of availability, three standby robots having failure rate a 0 -= 0.7, a 1 = 0.4, and a 2 = 0.2 robots per day and one robot which cannot fail in active mode, are taken as warm and cold standby units, respectively. Whenever a robot fails, its failure is detected, diagnosed and recovered with probability c = 0.5 and recovery rate is assumed to be r = 0.8. If fault is not detected due to imperfect coverage, it is cleared by the reboot or reset operation with a rate b = 10 per day. To evaluate the performance results, coding of the program is done in MATLAB software. The subroutine ode45 is employed for solving the set of equations associated with the probabilities of different system states. For the computation of results, we fixed the various parameters as: The sensitivity analysis has been done to explore the effect of varying parameters on different performance indices. The numerical results displayed in the form of tables and graphs are quite easy to understand the behavior of system. The numerical results obtained have been displayed in graphs and tables.
The numerical results for various performance indices for varying values of different parameters are presented in Tables 1, 2, 3. The effects of different parameters are examined by displaying the numerical results for the availability and average number of failed units in the system in Figs. 2,3,4,5,6,7. From Figs. 2,3,4, 5, we observe that as time increases, the system availability initially decreases rapidly and then after becomes almost constant. In the case of curves of E{N(t)}plotted in Figs. 6, 7, sharp increment is noticed up to t = 20, then after it becomes asymptotically stable as time passes.
The trends of various performance indices for varying different parameters are as follows: 1. Effect of failure rate of operating unit and repairman (k, a): from Figs. 2 and 3, it is noted that the availability of the system decreases as k and a increase. From Table 1, it is clear that the average number of failed units increases as k increases. 2. Effect of reboot rate and service rate ðb; lÞ: Table 2 displays that the average number of failed units E{N(t)} exhibits the increasing trend as reboot rate b increases; on the contrary, the availability A(t) of the system decreases as reboot rate b increases. It is seen in Fig. 5 that as l increases, the availability A(t) of the system increases while E{N(t)} decreases. 3. Effect of repair rate of repairman and coverage factor ðb; cÞ: From Figs. 4 and 7, it is noticed that the availability A(t) of the system increases as repair rate (b) increases. The average number of failed machines     Neuro-fuzzy technique is applied to compute the performance indices of the fault tolerant MRP. The membership function of input parameters k, a and b are taken as trapezoidal function by taking the very low, low, average, high and very high values as depicted in Fig. 8. The numerical results by ANFIS approach have been computed by using the neuro-fuzzy tool in Matlab software. To facilitate the comparison of results obtained by Runge-Kutta method with neuro-fuzzy results, we plot the availability by both approaches in Figs. 9, 10, 11.
The sensitivity of availability obtained by R-K method is depicted by continuous line for varying parameters k, a and b. The numerical results obtained using neurofuzzy technique are shown by tick marks. From these figures, we notice that both R-K and ANFIS results are quite close to each other.

Conclusion
In this paper, we have investigated the performance of fault tolerant system supported by mixed standbys and repair facility. Some realistic concepts such as reboot, recovery, unreliable server and vacation are incorporated in order to develop generalized queueing model so as to analyze the queueing and reliability characteristics of many real time fault tolerant systems. The transient probabilities of the system states and other performance measures evaluated    using Runge-Kutta method can be easily implemented for the evaluation of cost function and to determine the optimal control parameters for many redundant systems such as embedded computer and communication systems, power plants, automated manufacturing systems, etc. The sensitivity analysis carried out may be helpful to the decision makers and system engineers for the further improvement of the concerned systems operating under machining environments. There is scope of further extension of present investigation by incorporating the correlated rates and/ or bulk failure and work is under process in that direction.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.