1 Introduction

Among all software reliability growth models (SRGMs), a large family of stochastic reliability models based on a non-homogeneous Poisson process (NHPP), known as NHPP reliability models, has been widely used to track reliability improvement during software testing. Many existing NHPP software reliability models [126] have been carried out through the fault intensity rate function and the mean value functions \(m(t)\) within a controlled testing environment to estimate reliability metrics such as the number of residual faults, failure rate, and reliability of software. Generally, these models are applied to the software testing data and then used to make predictions on the software failures and reliability in the field. In other words, the underlying common assumption of such models is that the operating environments and the developing environment are about the same. The operating environments in the field for the software, in reality, are quite different. The randomness of the operating environments will affect the software failure and software reliability in an unpredictable way.

Estimating software reliability in the field is important, yet a difficult task. Usually, software reliability models are applied to system test data with the hope of estimating the failure rate of the software in user environments. Teng and Pham [3] have discussed a generalized model that captures the uncertainty of the environments and its effects upon the software failure rate. Other researchers [8, 1921, 24, 27] have also developed reliability and cost models incorporating both testing phase and operating phase in the software development cycle for estimating the reliability of software systems in the field. Software development is a very complex process and there are still issues that have not yet been addressed. Testing coverage is one of these issues. Testing coverage [27] is a measure that enables software developers to evaluate the quality of the tested software and determine how much additional effort is needed to improve the reliability of the software. Testing coverage can provide customers with a quantitative confidence criterion when they plan to buy or use the software products.

In this paper, we present two new software reliability models. The first model is, called Loglog fault-detection rate, an NHPP model where the fault-detection rate is based on a loglog distribution function. The second is, called testing coverage model with uncertainty environments, also an NHPP with considerations of the uncertainty of operating environments where the testing coverage function follows the Loglog distribution. The explicit solution of the mean value functions for these new models are derived in Sect. 2. Criteria for model comparisons and a new method called normalized criteria distance (NCD), for selecting the best model is discussed in Sect. 3. Model analysis and results are discussed in Sect. 4 to illustrate the goodness-of-fit criteria of proposed models and compare them with several existing NHPP models based on three common criteria such as mean square error, predictive-ratio risk, and predictive power from a set of software failure data. Section 5 concludes the paper with remarks.

2 Software reliability modeling

2.1 An NHPP loglog fault-detection rate model

Many existing NHPP models assume that failure intensity is proportional to the residual fault content. A general NHPP mean value function \(m(t)\) with time-dependent fault detection rate is given by [2]:

$$\begin{aligned} m(t)=N\left[ {1-\mathrm{e}^{-\int _0^t {h(x)\mathrm{d}x} }} \right] \end{aligned}$$
(1)

In this paper, we consider that the software fault-detection rate per unit of time, \(h(t)\), has a Vtub-shaped based on a loglog distribution function and is given by [2]:

$$\begin{aligned} h(t)=b\ln (a) t^{b-1} a^{t^{b}}\qquad \text{ for } a>1, \,b>0 \end{aligned}$$
(2)

It should be noted that the loglog distribution has a unique Vtub-shaped curve while the Weibull distribution has a bathtub-shaped curve. They, however, are not the same. As for the Vtub-shaped from the Loglog distribution, after the infant mortality period, the system starts to experience at a relatively low increasing rate, but not at a constant rate, and then increases with failures due to aging. For the bathtub-shaped, after the infant mortality period, the useful life of the system begins. During its useful life, the system fails as a constant rate. This period is then followed by a wear out period during which the system starts slowly and increases with the onset of wear out. Figure 1 describes the Vtub-shaped function \(h(t)\) for various values of parameter \(a\) where \(b\) = 0.489.

From Eq. (2), we can obtain the expected number of software failures detected by time t using Eq. (1):

$$\begin{aligned} m(t)=N\left( {1-\mathrm{e}^{-\left( {a^{t^{b}}-1}\right) }}\right) \end{aligned}$$
(3)
Fig. 1
figure 1

Fault-detection rate function \(h(t)\) for various values of \(a\) and \(b\) = 0.489

2.2 An NHPP testing coverage model with random environments

Testing coverage is important information for both software developers and customers of software products. Such information can be used by managers in order to determine how much additional effort is needed to improve the quality of the software products.

A generalized mean value function \(m(t)\) based on the testing coverage function subject to the uncertainty of operating environments can be obtained by solving the following defined differential equation:

$$\begin{aligned} \frac{\mathrm{d}m(t)}{\mathrm{d}t}=\eta \frac{\frac{\partial c(t)}{\partial t}}{( {1-c(t)})}[N-m(t)] \end{aligned}$$
(4)

where \(c(t)\) represents the testing coverage and \(\eta \) is a random variable that represents the uncertainty of system detection rate in the operating environments with a probability density function \(g\). The closed-form solution for function \(m(t)\) in term of random variable \(\eta \) with an initial condition \(m(0) = 0\) is given by:

$$\begin{aligned} m(t)=N\left[ {1-\mathrm{e}^{-\eta \int _0^t {\frac{c'(x)}{( {1-c(x)})}\mathrm{d}x} }} \right] \end{aligned}$$
(5)

If we assume that the random variable \(\eta \) has a gamma distribution with parameters \(\alpha \) and \(\beta \) where the pdf of \(\eta \) is given by

$$\begin{aligned} g(x)=\frac{\beta ^\alpha x^{\alpha -1}\mathrm{e}^{-\beta x}}{\Gamma (\alpha )}\qquad \text{ for }\,~\alpha ,\, \beta >0;\, x\ge 0 \end{aligned}$$
(6)

then from Eq. (5), we can obtain [21]:

$$\begin{aligned} m(t)=N\left[ {1-\left( {\frac{\beta }{\beta +\int \nolimits _0^t {\frac{c'(s)}{1-c(s)}} \mathrm{d}s}}\right) ^\alpha } \right] \end{aligned}$$
(7)

In this paper, we assume that the testing coverage function has a loglog distribution [2] as follows:

$$\begin{aligned} c(t)=1-\mathrm{e}^{1-a^{t^b}} \quad \text{ for } a>1,\, b>0 \end{aligned}$$
(8)

Figures 2 and 3 describe the testing coverage function \(c(t)\) and testing coverage rate \(c'(t)\) for various values of parameter \(a\) where \(b = 0.196\). We observe that for a given value \(b\), as parameter \(a\) increases the testing coverage function increases but the testing coverage rate decreases.

Fig. 2
figure 2

Testing coverage function \(c(t)\) for various values of \(a\) and \(b = 0.196\)

Fig. 3
figure 3

Testing coverage rate function \(c'(t)\) for various values of \(a\) and \(b = 0.196\)

Substitute the function \(c(t)\) into Eq. (7), we can easily obtain the expected number of software failures detected by time t with random environments:

$$\begin{aligned} m(t)=N\left( {1-\left( {\frac{\beta }{\beta +a^{t^b}-1}}\right) ^\alpha }\right) \end{aligned}$$
(9)

Table 1 summarizes the two proposed models and several existing well-known NHPP models with different mean value functions.

Table 1 Software reliability models

3 Normalized criteria distance method

Once the analytical expression for the mean value function \(m(t)\) is derived, the model parameters to be estimated in the mean value function can be obtained with a help of developed Matlab programs that based on the least square estimate (LSE) method.

There are more than a dozen of existing goodness-of-fit test criteria. Obviously different criteria have different impact in measuring the software reliability due to the selection among the existing models and, however, that no software reliability model is optimal for all contributing criteria. This makes the job of developers and practitioners much more difficult when they need to select an appropriate model, if not the best, to use from among existing SRGMs for any given application based on a set of criteria.

In this section, we discuss a new method called, NCD, for ranking and selecting the best model from among SRGMs based on a set of criteria taken all together with considerations of criteria weight \(w_{1}\), \(w_{2}\),..., \(w_{d}\). Let \(s\) denotes the number of software reliability models with \(d\) criteria, and \(C_{ij}\) represents the criteria value of ith model of jth criteria where \(i = 1, 2, \ldots , s\) and \(j = 1, 2, \ldots , d\).

The NCD value, \(D_{k}\), measures the distance of the normalized criteria from the origin for kth model and can be defined as follows [21]:

$$\begin{aligned} D_k =\sqrt{\left( {\sum \limits _{j=1}^d {\left( {\left( {\frac{C_{kj} }{\sum \nolimits _{i=1}^s {C_{ij} } }}\right) ^2w_j }\right) } }\right) }\qquad k=1, 2, \ldots , s\nonumber \\ \end{aligned}$$
(10)

where \(s\) and \(d\) are the total number of models and total number of criteria, respectively, and \(w_{j}\) denotes the weight of the criterion \(j\) for \(j = 1, 2, \ldots , d\).

Thus, the smaller the NCD value, \(D_{k}\), it represents the better rank as compare to higher NCD value. In Sect. 4, we use three common criteria such as the mean square error, the predictive-ratio risk, and the predictive power, to illustrate the proposed NCD method.

4 Model analysis and results

4.1 Some existing criteria

As mentioned in Sect. 3, there are more than a dozen of existing goodness-of-fit criteria. In this study, we discuss briefly three common criteria in this section and use them to compare those models as listed in Table 1. They are: the mean square error, the predictive-ratio risk, and the predictive power.

The mean square error (MSE) measures the deviation between the predicted values with the actual observation and is defined as:

$$\begin{aligned} \mathrm{MSE}=\frac{\sum \nolimits _{i=1}^n {( {\hat{m}(t_i )-y_i })} ^2}{n-k} \end{aligned}$$
(11)

where \(n\) and \(k\) are the number of observations and number of parameters in the model, respectively.

The predictive-ratio risk (PRR) measures the distance of model estimates from the actual data against the model estimate, and is defined as [17]:

$$\begin{aligned} \mathrm{PRR}=\sum \limits _{i=1}^n {\left( {\frac{\hat{m}(t_i )-y_i }{\hat{m}(t_i )}}\right) ^2} \end{aligned}$$
(12)

where \(y_{i}\) is total number of failures observed at time \(t_{i}\) according to the actual data and \(\hat{m}(t_i )\) is the estimated cumulative number of failures at time \(t_{i}\) for \(i =1, 2, {\ldots }, n\).

The predictive power (PP) measures the distance of model estimates from the actual data against the actual data, is as follows:

$$\begin{aligned} \mathrm{PP}=\sum \limits _{i=1}^n {\left( {\frac{\hat{m}(t_i )-y_i }{y_i }}\right) ^2} \end{aligned}$$
(13)

For all these three criteria—MSE, PRR, and PP—the smaller the value, the better the model fits, relative to other models run on the same data set.

4.2 Software failure data

A set of system test data was provided in [2, p. 149] which is referred to as Phase 2 data set and is given in Table 2. In this data set the number of faults detected in each week of testing is found and the cumulative number of faults since the start of testing is recorded for each week. This data set provides the cumulative number of faults by each week up to 21 weeks. We perform the calculations for LSE estimates and other measures using Matlab programs.

Table 2 Phase 2 system test data [2]

4.3 Model results and comparison

Table 3 summarizes the results of the estimated parameters for all ten models as shown in Table 1 using the least square estimation (LSE) method and its criteria (MSE, PRR, and PP) values. The coordinates \(X, Y\) and \(Z\) in Fig. 4 illustrate the MSE, PRR, and PP criteria values, respectively, of the models. From Table 3, we observe that model 10 has the smallest MSE value, while model 9 has the smallest PRR value, and model 8 has the smallest PP value.

Fig. 4
figure 4

A three-dimension plot (\(X,Y, Z\)) represents (MSE, PRR, PP) values when \(w_{1} = 0.3\), \(w_{2} = 100\), \(w_{3} = 0.1\)

Table 3 Model parameter estimation and comparison criteria

It is worthwhile noting that although both the PRR and PP values for the proposed testing coverage model with uncertainty (model 10) are slightly larger than the dependent parameter model (model 8), the MSE value for model 10 is significantly smaller than the dependent parameter model 8. Similarly, to compare all the models based on the PRR criterion, we find that the proposed loglog fault-detection rate (model 9) provides the best fit with the smallest PRR value.

As we can see from Table 3, the selection of the best model will then depend upon the modeling criteria. We now illustrate the proposed NCD method (in Sect. 3) to obtain the ranking results of all the ten models from Table 3 based on all three goodness-of-fit criteria such as MSE, PRR, and PP.

The modeling comparison and results for the case when all the criteria weight are the same (i.e., \(w_{1}= w_{2}= w_{3} = 1\)) and when all are not the same (\(w_{1}=\) 0.3, \(w_{2}\!=\!\) 100, \(w_{3}\! =\! 0.1\)) are presented in Tables 4 and 5, respectively. In other words, using Eq. (10) and the criteria values and results given in Table 3, we obtain the NCD values and their corresponding ranking as shown in Table 4 for all \(w_{j} = 1\) for \(j =1, 2, \hbox {and } 3\). Table 5 shows the NCDs and their corresponding ranking when \(w_{1} = 0.3\), \( w_{2} = 100\), and \(w_{3} = 0.1\). In Fig. 4, the coordinates X, Y and Z represent the corresponding of the MSE, PRR, and PP values of each model for criteria weight \(w_{1} = 0.3\), \( w_{2} = 100\), and \(w_{3} = 0.1\). The delayed s-shaped (model 2) for example, \(X = 3.27, Y = 44.27\), and \(Z = 1.43\), indicates the MSE, PRR, and PP values of model 2. Figure 5 illustrates the model ranking based on the NCD values given in Table 5 for criteria weight \(w_{1} = 0.3\), \(w_{2} = 100\), and \(w_{3} = 0.1\). For example, a set of coordinates (\(X = 10, Y = 1\), and \(Z = 0.03698\)) indicates that (shown in Table 5) model 10 is ranked the best (1st) where the NCD value is 0.03698.

Fig. 5
figure 5

A three-dimension plot with the model ranking and NCD values for \(w_{1} = 0.3\), \( w_{2} = 100\), \(w_{3} = 0.1\)

Table 4 Parameter estimation and model comparison when \(w_{j}=1\) for \(j =1,2,3\)
Table 5 Parameter estimation and model comparison when \(w_{1}=0.3\), \( w_{2} = 100\), \( w_{3} = 0.1\)

Based on this study we can draw a conclusion that the proposed testing coverage model (model 10) and the loglog fault detection rate (model 9) can provide the best fit based on the MSE and PRR criteria, respectively. The NCD method in general is a simple and useful tool for modeling selection. Obviously, further work in broader validation of this conclusion is needed using other data sets as well as other comparison criteria.

5 Conclusion

We present two new software reliability models by considering a loglog fault-detection rate function and the testing coverage subject to the uncertainty of the operating environments. The explicit mean value function solutions for the proposed models are presented. The results of the estimated parameters of proposed models and other NHPP models and their MSE, PRR, and PP are also discussed. We also discuss an NCD method for obtaining the model ranking and selecting the best model from among SRGMs based on a set of criteria taken all together. Example results show that the presented new models can provide the best fit based on the NCD method as well as some studied criteria. Obviously, further work in broader validation of this conclusion is needed using other data sets as well as considering other comparison criteria.