1 Introduction

Software testing plays an essential role in guaranteeing the behavior of software systems. However, modern software systems are getting more and more large-scale and complicated. Testing them is tough work because the system behaviors are affected by many factors, e.g., various input parameters and settings. Even worse, defects are sometimes caused by interactions between multiple factors rather than a single factor. However, testing all interactions of factors is impractical, especially for modern industrial software.

Combinatorial testing (Nie & Leung, 2011; Kuhn et al., 2013) is known as an efficient black-box testing techniques to detect test failures that occurred by interactions of multiple factors. Combinatorial testing uses only the minimum number of test cases covering all possible combinations of input values to satisfy a particular criterion. The cost of software testing is reduced by focusing on finding defects due to a combination of only a certain number of factors rather than all input patterns.

However, the combinatorial testing brings a problem about Faulty Interaction Localization(FIL). The FIL is a process to identify which combination of input parameter values induced the detected test failures. Identifying the minimum conditions that reproduce test failures will make it easier to identify and repair the defect in a source code. However, it is not easy to identify such failure-inducing combinations from the combinatorial testing result. Test cases in combinatorial testing are prioritized to include many combinations but not to test specific behaviors. Therefore, it is essential to develop practical FIL approaches for effective combinatorial testing, e. g., many works are studied (Nie & Leung, 2011; Zhang et al., 2012; Ghandehari et al., 2012; Zeller & Hildebrandt, 2002; Niu et al., 2020).

BEN Ghandehari et al. (2012) is a powerful existing FIL method, which analyzes the combinatorial testing results to estimate which combination is most likely to induce failure. For this purpose, BEN calculates suspiciousness for each possible combination by their algorithm. However, BEN has two concerns: (1) The accuracy of suspiciousness estimation is not enough. BEN compensates its accuracy with creating and running additional tests, but increasing accuracy can reduce the cost of additional tests if possible. (2) Analyzing large-size combinations may need an unrealistic processing time due to a combinatorial explosion.

This paper proposes two approaches to improve for each concern in the existing FIL approach, using logistic regression analysis. Our first approach FROGa focuses on the concern (1), and our second approach FROGb focuses on the concern (2). The naming of FROG means "FIL based on Regression coefficient of lOGistic regression."

FROGa aims to calculate the suspiciousness of combinations inducing test failures more accurately by using logistic regression analysis. FROGa first extracts all possible failure-inducing combinations of parameter values from test results, then encodes the inclusion relationship between the combinations and each test case. Finnaly, FROGa inputs the encoded data to logistic regression to obtain regression coefficients as the suspiciousness of each combination.

To evaluate the performance of FROGa, we implemented FROGa and BEN, then applied them to the same experimental subjects. As the experimental subjects, we artificially generated many combinatorial testing results to ensure the validity of the evaluation result. One hundred twenty-six thousand artificial testing results were generated using the input parameter model of six real software systems and Pict, a combinatorial testing generation tool.

According to our experiment, FROGa can identify the input values more accurately than BEN. This experiment used the artificially generated combinatorial testing results on six real systems. In the experiments, we obtained the results of the following two research questions (RQ).

RQ1:

Does FROGa improve the accuracy of suspiciousness calculation compared to BEN? Yes. In particular, FROGa can significantly improve the accuracy of ranking the suspiciousness of combinations in the case of targeted testing results with high coverage.

RQ2:

How much is the difference in the time cost between FROGa and BEN? There is little difference in time cost between FROGa and BEN.

Next, FROGb aims to directly estimate failure-inducing combinations by estimating the subsets of those combinations. FROGb uses logistic regression analysis to estimate suspicious input parameter values as subsets of the failure-inducing combination and then explore the superset of values step by step. Using FROGb, only the most suspicious combinations can be extracted without dealing with all possible combinations, thus avoiding a combinatorial explosion.

According to our experiment with the same subjects as for FROGa, FROGb can significantly reduce the time cost to find the large size failure-inducing combinations with high accuracy. More specifically, we obtained the results of the following two research questions.

RQ3:

How much does FROGb reduce the time of extracting highly suspicious combinations compared toBEN and FROGa? FROGb can significantly reduce the time compared to BEN and FROGa in the case of targetting testing results with high coverage.

RQ4:

How accurately can FROGb extract failure-inducing combinations? FROGb can extract all failure-inducing combinations in 63.3% of all and can extract at least one of them in 95.6% of all.

The idea behind FROGb was introduced in our previous work Nishiura et al. (2017), but this paper provides a more detailed modeling of the approach. Unlike earlier studies that examined only a small number of subjects, this paper explores new experimental results, including trends due to subject characteristics and limitations of defect interaction extraction. These results are novel contributions. Both FROGa and FROGb follow the same philosophy of using logistic regression analysis for fault localization, and we present them in parallel to enhance mutual understanding.

The rest of this paper is organized as follows: Section 2 gives backgrounds and definitions. Section 3 introduces the first proposal FROGa, and Section 4 evaluates FROGa. Section 5 introduces the second proposal FROGb, and Section 6 evaluates FROGb. Section 7 discusses threats for validities. Section 8 introduces related work. Finally, Section 9 concludes this paper.

2 Backgrounds

2.1 Combinatorial testing

Combinatorial testing is one of black-box testing that focuses on the combination of multiple input parameter values (Nie & Leung, 2011; Kuhn et al., 2013). The primary purpose of combinatorial testing is to efficiently detect failures caused by the interaction of such multiple input parameter values. For this purpose, the combinatorial testing methodology designs a test suite with a few test cases as possible while covering all combinations of input parameter values below a certain number. A test suite designed in this way is also referred to as a covering array. A survey Kuhn and Reilly (2003) reported that the upper limit of the number of combinations of input values that can induce failures is between four and six. Therefore, combinatorial testing that focuses on only a few combinations is practical.

The test that attempts to detect failures caused by the interaction of t or fewer input values using a covering array that covers t or less input parameter values is referred to as a t-way testing. In a t-way testing, all combinations of t or fewer parameter values are tested at least once with a minimum number of test cases. The value of t in the t-way testing is referred to as the combinatorial coverage in this paper.

We will formally define some issues related to combinatorial testing. First, the input model of the System Under Test (SUT) for combinatorial testing is modeled in terms of parameters, their possible values, and the constraints between the parameter values.

Definition 1

(SUT) An SUT model for combinatorial testing is \(\langle P, V, \phi \rangle\), where

  • P is a set of parameters \(p_1,\ldots ,p_{n}\) with \(n=|P|\),

  • V is a collection of \(V_i\) (\(1\le i\le n\)) that is a set of values assigned to a parameter \(p_i\), and

  • \(\phi\) is a set of constraints on combinations of parameter values.

A test case is a tuple assigned to each parameter a value that does not violate SUT constraints.

Definition 2

(Test case) Given an SUT \(\langle P, V, \phi \rangle\), a test case is a n-tuple (\(v_1,\ldots ,v_n\)) with \(n=|P|\) and \(v_i\in V_i\) (\(1\le i\le n\)) that does not violate \(\phi\). A test suite is a set of test cases.

A schema is a formal representation of a combination of input parameter values. This definition is initially defined in the previous study Nie and Leung (2011) and is used in the recent study Niu et al. (2020) as well.

Definition 3

(Schema) For the SUT, the n-tuple \((-,v_{n_1},...,\) \(v_{n_k},...)\) is refered to as a k-degree schema \((0 < k \le n)\) when some k parameters have fixed values and other irrelevant parameters are represented as “−”.

Definition 4

(Sub-schema and Super-schema) Let \(c_l\) be an l-degree schema, \(c_m\) be an m-degree schema in SUT and \(l < m\). If all the fixed parameter values in \(c_l\) are also \(c_m\), then \(c_m\) subsumes \(c_l\). In this case, we can also say that \(c_l\) is a sub-schema of \(c_m\), and \(c_m\) is super-shchema of \(c_l\), which can be denoted as \(c_l \prec c_m\). For example, a 2-degree schema (-, 4, 4, -) is a sub-schema of a 3-degree schema (-, 4, 4, 5), that is, (-, 4, 4, -) \(\prec\) (-, 4, 4, 5).

Table 1 Example SUT: An example SUT model
Table 2 Example Test Result: An example of 3-way test suite and its testing result of Example SUT

Example 1

Table 1 shows an example SUT model. This system has five configurable parameters: CPU, Network, DBMS, OS, and Browser. The first three parameters have two possible values, and the remaining two parameters have three possible values. There is a constraint that we must use CPU \(\ne\) AMD when OS = Mac, and OS = Win must be used when Browser = IE. In other words, the combinations of the parameter values (Mac, AMD), (Linux, IE), and (Mac, IE) are not allowed.

Table 2 shows the covering array that realizes the 3-way test of the SUT shown in Table 1 (not including the "result" column). This test suite consists of 20 test cases and includes all possible 115 3-degree schemas (Intel, Wifi, MySQL, -, -), \(\ldots\), (-, -, Sybase, Mac, Chrome) at least once in the test cases while satisfying the constraints. On the other hand, there are 54 possible combinations between these parameter values based on the constraints, and a total of 54 test cases are needed to cover all of them. Therefore, combinatorial testing can significantly reduce the number of test cases as a tradeoff against the coverage criteria.

2.2 Faulty interaction localization

Faulty Interaction Localization (FIL) is the process of identifying combinations of input parameter values that induce faulty behavior based on the results of combinatorial tests. Developers can quickly identify and fix faulty components by identifying the testing input values that are the minimum requirements to reproduce the faulty behavior.

The purpose of FIL is to identify minimal schemas which induce failures from given combinatorial testing results. We refer to this schema as a failure-inducing schema.

Definition 5

(Failure-inducing schema) A schema s is referred to as a Minimal Failure-inducing Schema if all test cases including the schema s always cause a particular failure and none of the sub-schemas of s cause the failure. In this paper, we refer to it as just failure-inducing schema.

Furthermore, some FIL methods extract all possible failure-inducing schemas from the total combinatorial testing result at first and then narrow down the failure-inducing schemas from these schemas. Such a schema that may be failure-inducing is defined as a candidate schema.

Definition 6

(Candidate schema) Given test suite T and test oracle \(R:tc \subseteq T\rightarrow \{\textsf {pass},\textsf {fail} \}\), a candidate schema is a schema s such that

  • there is a test case \(tc\) in T such that \(tc \supseteq s\), and

  • for every test case \(tc\) in T such that \(tc \supseteq s\), \(R( tc )=\textsf {fail}\).

Table 3 Example Candidate Schemas: The candidate schemas extracted from the Example Test Result

Example 2

In the example system shown in Table 1, a failure occurs when CPU = Intel and OS = Linux. Also, another failure occurs when CPU = AMD, Network = Wifi and DBMS = Sybase. Therefore, two schemas, the 2-degree schema (1, -, -, 2, -) and the 3-degree schema (-, 2, 1, -, 1), are failure-inducing schemas.

Executing the series of test cases shown in Table 2, the four test cases #8, #11, #12, and #20 that include either failure-inducing schema failed. Therefore, the test results as shown in the Result column of Table 2 are obtained. According to the definition of candidate schema, extracting all candidate schemas with three degrees or less, there are 11 candidate schemas, as shown in Table 3. Every candidate schema is numbered from \(s_1\) to \(s_{11}\) for later explanation. The underlined candidate schemas are failure-inducing schemas.

2.3 Logistic regression analysis

Logistic regression is a well-known statistical model for the regression of a binary dependent variable and is mainly used as a supervised machine learning method Cramer (2002). In the field of software engineering, this model is often used for classifying fault-prone modules with software metrics (Basili et al., 1996; Briand et al., 2002; Kamei et al., 2016).

The logistic model is represented by the following equation:

$$\begin{aligned} Pr(Y=1|x_1, \cdots , x_n) = \frac{1}{1+e^{-({b_0}+{b_1x_1}+ \cdots +{b_nx_n})}} \end{aligned}$$

where \(x_1, \dots , x_n\) are the independent variables, \(b_1, \dots , b_n\) are their coefficients, \(b_0\) is the intercept, Y is the binomial dependent variable that takes 0 or 1, and Pr is the conditional probability that \(Y=1\) given the values of \(x_1, \dots , x_n\). In the logistic regression, the regression equation is obtained by calculating the values of \(b_1, \dots , b_n\), i.e. regression coefficients, from the training data so that the error between the conditional probability and the correct answer value is the smallest. In using this model as a classifier, the test data is substituted into the regression equation, the conditional probability that the binomial dependent variable is positive is calculated, and the classification result is presented according to the threshold.

Logistic regression is also used as an analytical data method. In this study, we use the analytical method. Logistic regression analysis is based on that the regression coefficient can be viewed as the degree to which the probability Pr is affected by a unit increase in the corresponding independent variable Peng et al. (2002). A significant regression coefficient indicates that a change in the value of that independent variable has a significant impact on the variation of the probability Pr. Therefore, a regression coefficient can be regarded as the importance of the independent variable for that the binary dependent variable is positive.

3 Improved FIL method using logistic regression

3.1 Concept of FROGa

In this section, we propose FROGa as an improved FIL method. FROG is based on the existing FIL method, BEN. BEN calculates and then ranks the suspiciousness of candidate schemas using the original algorithm with several software metrics. On the other hand, FROGa attempts to rank candidate schemas more accurately by using logistic regression analysis.

We explain the logic that logistic regression analysis can calculate the suspiciousness of a candidate schema. Remember that a regression coefficient can be regarded as the importance of the independent variable for that the binary dependent variable is positive. Let the independent variable be binary (\(1=\) included, \(0=\) not included), representing the inclusion relationship with a test case in a candidate schema. The dependent variable is binary (\(1=\) failed, \(0=\) passed), representing the test result. In this case, the regression coefficient means the impact level that changing the corresponding independent variable change from 0 to 1 affected the probability of the test case fails. This independent variable change means not including the schema in the test case to including. Therefore, a large regression coefficient schema is likely to be a failure-inducing schema. Thus, a regression coefficient is expected to determine the magnitude of suspiciousness.

3.2 Model

FROGa needs three inputs: a test suite, the test results for each test case, and k, which is the assumed maximum degree of failure-inducing schema. It is necessary that the test results are classified into pass, or fail by some test oracle. Moreover, k is the degree of schema that a given combinatorial test suite covers.

First, FROGa extracts all candidate schemas whose degree is less than or equal to k from the test suite and the results. Let \(S_k\) be the set of these candidate schemas.

Next, FROGa creates a data table \(\Phi\), representing the inclusion relationship between every candidate schema in \(S_k\) and every test case. Define a function inc(sctc) which represents the inclusion of a candidate schema sc in a test case schema st as follows.

$$\begin{aligned} inc(sc, st) = \left\{ \begin{array}{ll} 1 &{} (sc \prec st) \\ 0 &{} (sc \not \prec st). \end{array} \right. \end{aligned}$$

With the function inc(scst), the data table \(\Phi\) is expressed as follows,

$$\begin{aligned} \Phi = \left[ \begin{array}{ccc} inc(sc_1, st_1) &{} \ldots &{} inc(sc_m, st_1) \\ \vdots &{} \ddots &{} \vdots \\ inc(sc_1, st_n) &{} \ldots &{} inc(sc_m, st_n) \end{array} \right] \end{aligned}$$

where m is the number of extracted candidate schemas i. e. \(|S_k|\), and n is the number of test cases in the test suite. The columns of \(\Phi\) correspond to each candidate schema, and the rows correspond to each test case, and it is can be said that \(\Phi\) is the encoded inclusion relations of them.

Moreover, FROGa creates a pseudo-Boolean vector R that represents the test results for each test case. When the test suite consists of n test cases \(st_1,\dots ,st_n\), with the function \(result(st)\rightarrow \{0,1\}\) which encodes the test result of st as pass = 0 and fail = 1, R is expressed as follows.

$$\begin{aligned} R = \left[ result(st_1),\dots ,result(st_n) \right] ^T \end{aligned}$$

Then, FROGa runs a logistic regression with every column vector of \(\Phi\) as the independent variable and R as the dependent variable. As a result, the regression coefficients for each corresponding candidate schema are obtained.

Finally, FROGa ranks the candidate schemas by the obtained regression coefficient as the magnitude of suspiciousness. The larger the regression coefficient, the higher the possibility that a candidate schema is failure-inducing.

3.3 Simplification

Next, we will write the simplification of the encoding in FROGa. In summary, the row vectors inf \(\Phi\) and R corresponding to the passed test cases can be omitted.

According to the definition of the candidate schema, candidate schemas are not included in any passed test cases. Therefore, the row vectors in \(\Phi\) and R that corresponds to the passed test cases are always all-zero vectors. Thus, all those row vectors are the same. In logistic regression, the same duplicate data does not affect the update of the regression equation. This characteristic allows the row vector corresponding to the passed test case to be deleted except only one row. Therefore, FROGa can use simplified \(\Phi '\) and \(R'\) instead of \(\Phi\) and R as follows.

$$\begin{aligned} \Phi ' = \left[ \begin{array}{ccc} inc(sc_1, st_1) &{} \ldots &{} inc(sc_m, st_1) \\ \vdots &{} \ddots &{} \vdots \\ inc(sc_1, st_f) &{} \ldots &{} inc(sc_m, st_f) \\ 0 &{} \dots &{} 0 \end{array} \right] \end{aligned}$$
$$\begin{aligned} R' = \left[ 1,\dots ,1, 0 \right] ^T \end{aligned}$$

where F is the set of failed test cases, and \(f=|F|\). \(\Phi '\) is a data table with m columns and \(f+1\) rows, and \(R'\) is a column vector with \(f+1\) dimensions.

3.4 Application example of FROGa

For a better understanding of FROGa, we show an example application. Table 4 represents the \(\Phi\) and R created from the Example Test Result (Table 2) and the Example Canididate Schemas (Table 3). In addition, Table 5 represents \(\Phi '\) and \(R'\) as simplificated them. Table 6 shows the values of regression coefficient and the ranks of these candidate schemas that are obtained by running logistic regression inputted \(\Phi '\) and \(R'\). In this example, we used the logistic regression implemented in scikit-learn (scikit-learn) with all defaulted optional parameters. In this example, both underlined failure-inducing schemas have the highest regression coefficients, and they are at the top of the ranking. This result shows that FROGa is working well.

Table 4 Example \(\Phi\) and R: The \(\Phi\) which represents the inclusion relationship of each Example Candidate Schema (\(s_1 \sim s_{11}\)) in each test case (1 = included , 0 = not included), and The R which represents the test results of each test case (1 = fail, 0 = pass), created from Example Test Result
Table 5 Example \(\Phi '\) and \(R'\): The simplified Example \(\Phi\) and R
Table 6 The logistic regression coefficients of each Example Candidate Schema and their ranks calculated from Example \(\Phi '\) and \(R'\)

4 Evaluation of FROGa

4.1 Research questions

We set the following research questions to evaluate the efficacy of FROGa.

RQ1.:

Does FROGa improve the accuracy of suspiciousness calculation in FIL compared to BEN?

RQ2.:

How much is the difference in the time cost between FROGa and BEN?

In order to answer these research questions, we applied FROGa and BEN to several combinatorial testing results and compared their accuracy and processing time.

4.2 Experimental subjects

We used artificially created combinatorial testing results for the experiments rather than actual reported combinatorial testing results. We could not find any bug reports with both defects detected by combinatorial testing and the used test suites. Furthermore, we believed that using a few experimental subjects does not bring valid results even though the subjects are real ones. Therefore, we used a lot of artificially generated test results by assuming several defects induced by several failure-inducing schemas in real software systems. We believe that the use of artificially generated test results in experiments has a shallow impact on validity because real-world failure-inducing schemas are included in the theoretical failure-inducing schemas we used.

Table 7 benchmark SUT models
Table 8 The number of test cases consisting of t-way test suite generated by PICT

Six real SUTs were used in the experiment as combinatorial testing targets. The parameter and constraint sizes of each SUT are given in Table 7. The parameter size of a SUT is expressed as \(|P|; g^{k1} g^{k2} \dots g^{kn}\), where \(k_i\) parameters with \(g_i\) values and |P| is the number of parameters. The size of the constraint is expressed as a series of standard forms \(l; h^{l1} h^{l2} \dots h^{lm}\), where m variables and \(h_j\) clauses with \(l_j\) literals for each j. The four SUTs, SystemMgmt, Storage3, ProcesserComm2, Healthcare4 are specific versions of IBM product programs. Moreover, SPINS and SPINV is the simulator and the verifier in SPIN, which is an open-source model checking tool. These SUT models were randomly selected from several models published in Cohen et al. (2008) with a broad range of input sizes.

In addition, we used Pict (Pict) to design test suites. Pict is a well-known covering array generation tool provided by Microsoft. The inputs of Pict are the SUT model and the value of combinatorial coverage.

The artificial combinatorial testing results were generated through the following three steps.

  1. 1.

    Test suites generation We created t-way test suite \((t = 2, 3, 4)\) of each SUT with Pict. As a result, a total of 18 test suite was generated from the combination of six SUTs and three combinatorial coverages. Table 8 shows the number of test cases included in each test suite.

  2. 2.

    Failure-inducing schemas generation We randomly determined failure-inducing schema patterns that are assumed in each test suite. First, the number of each failure-inducing schema is randomly determined between 1 and 3. Then, the degree of each failure-inducing schema is randomly determined between 2 and t (for t-way tests). The reason for this upper limit is to be tested every failure-inducing schemas. Next, the parameter values assigned to the failure-inducing schemas are determined randomly. We prepared 10,000 patterns of failure-inducing schemas for each test suite by iterating the above operations without duplication.

  3. 3.

    Testing results generation The combinatorial testing results are obtained according to these failure-inducing schemas; the result of a test case including even one of the determined failure-inducing schemas is fail, otherwise pass. Finally, 10,000 cases of combinatorial testing results were obtained for each test suite for 2-way and 3-way testing. However, only 1,000 cases were obtained for the 4-way testing. This limitation was due to the time constraint of the experiment.

A previous research Kuhn and Reilly (2003) suggests that almost all bugs can be found in combination testing with strengths from 4 to 6. Thus, it’s common to set the maximum strength between 4 and 6, and in our experiment, we chose 4. Setting the strength to 5 or higher would create too many test cases, so we skipped it this time due to time constraints.

4.3 Experimental environment

The algorithms for BEN and FROGa were implemented in Python 3.5.1 and run on a MacBook Pro 2017. To implement logistic regression in FROGa, we used scikit-learn (scikit-learn) which is an open-source Python library for machine learning, and we used all optional parameters of logistic regression with default parameters. The timeout criterion of the FIL process for a single combinatorial testing result is 3,600 seconds. Encoding in FROGa with \(\Phi '\) and \(R'\) instead of \(\Phi\) and R.

4.4 Evaluation metrics

In order to answer RQ1, we used two rank-based accuracy evaluation metrics, MAP and top-k% accuracy. In order to answer RQ2, we measured processing time.

The Mean Average Precision (MAP) is a metric that evaluates the accuracy of a system’s ranking ability Manning et al. (2008). The value of MAP takes the value from 0 to 1. The higher the value, the more accurate the ranking is. This metric is often used to evaluate a query information retrieval system, which is expected to return search results arranged in order of relevance to a given query, such as search keywords. In this experiment, an input combinatorial testing result is treated as a query, and the ranked candidate schemas are treated as search results. To compute the MAP, \(AveP_q\) (Average Prediction) is first calculated for single query q out of the query set Q by the following formula:

$$\begin{aligned} AveP_{q} = \frac{1}{|R_p|} \sum ^{|R_q|}_{i=1} prec@n_{q_i}, \end{aligned}$$

where \(|R_q|\) is the number of the search results correctly related to the query q, i. e. the number of failure-inducing schema. In addition, \(prec@n_{q}\) is a percentage of the search results correctly related to q in top n results. After that, MAP is calculated from the mean of AveP of all queries \(q \in Q\) as follows:

$$\begin{aligned} MAP = \frac{1}{|Q|} \sum ^{|Q|}_{i=1} AveP_{q_i} \end{aligned}$$

Note that MAP is calculated for the ranking ability of a FIL method, while AveP is calculated for the accuracy of a single ranking result.

Next, top-k% accuracy is our defined rank-based metric based on top-k accuracy. The top-k accuracy means the success rate at the cutoff rank k in multiple trials when it is successful if all failure-inducing schemas are included in the top k of all ranked candidate schemas. However, we cannot simply use this metric because the number of candidate schemas is different in each trial in this experiment. Therefore, we defined and used top-k% accuracy using top-k% instead of top-k. This can be formulated as follows:

$$\begin{aligned} \mathrm{top-}k\%\mathrm{\ accuracy} = \frac{1}{|Q|} \sum ^{|Q|}_{i=1} InTopKp(k, q_i), \end{aligned}$$

where InTopKp(k, q) is a function that returns 1 if all relevant results are included in the top k% of search results for a query q, otherwise 0.

4.5 Results

In this section, we show the experimental results for evaluating FROGa. The processes did not complete within the timeout criterion in both BEN and FROGa. Therefore, the results of the 4-way test of Healthcare4 and SPINV are missing values(NA). Although we could not confirm all inputs, we have confirmed that all five randomly selected inputs in Healthcare4 and SPINV did not complete within the timeout criteria. Furthermore, we also confirmed that none of them had completed extracting candidate schemas from all schemas at a timeout by checking the execution logs.

4.5.1 Results for RQ1

Table 9 shows MAP of BEN and FROGa for each test suite. The mean value of MAP for FROGa in the 16 test suites excluding missing values is 0.76, and the mean value of MAP for BEN is 0.31. As a result, we can observe that the MAP of FROGa is well above BEN. Moreover, Fig. 1 shows a comparison of the distribution of the AveP as a box plot. The upper graph compares each SUT, and the lower graph compares each combinatorial coverage. Note that these graphs do not include missing values; the 4-way tests of Healthcare4 and SPINV. There is little difference between the SUTs, while there is a significant difference between the combinatorial coverage. BEN loss accuracy as the combinatorial coverage increases, but FROGa can keep high accuracy even as the coverage increases.

Table 9 MAP of BEN and FROGa
Fig. 1
figure 1

Comparison of AveP distribution of BEN and FROGa for each SUT (top) and combinatorial coverage (bottom)

Table 10 Persentages of cases that AveP=1

In addition, Table 10 shows a comparison of the percentage of samples that resulted in \(AveP=1\) out of all samples. The "\(AveP=1\)" means that all failure-inducing schemas are at the top of the ranking for an input. The result shows that FROGa is much more capable of ranking accurately than BEN. Therefore, FROGa can make a perfect ranking of suspiciousness with high accuracy. This result leads that FROGa can suggest that the most suspicious candidate schemas.

Table 11 MAP calculated for each the number of failure-inducing schema (#FS)

For a more detailed analysis, Table 11 shows the MAPs separately calculated for each the number of assumed failure-inducing schema (#FS). We can see that FROGa is always superior to BEN. Moreover, both BEN and FROGa decrease accuracy when several failure-inducing schemas exist. In particular, when there is only one failure-inducing schema, unlike BEN, FROGa succeeds in placing the failure-inducing schema at the top of the ranking in almost all cases. In addition, it is a significant difference in accuracy that the MAP for BEN is 0.07 and that the MAP for FROGa is 0.57 in the most challenging situation in this experiment; for the 4-way testing results with three failure-inducing schemas. As reported in the previous study Niu et al. (2020), the quality of the FIL decreases with the increase of failure-inducing schemas for the three adaptive methods, and our results confirm this trend.

Fig. 2
figure 2

Top-k% accuracy of BEN and FROGa

The graph in Fig. 2 shows the top-k% accuracy for each SUT. The graph’s horizontal axis represents the value of k, and the vertical axis represents the value of top-k% accuracies. Reaching the graph to 1 in fast means that developers can find all failure-inducing schemas by checking the candidate schemas in order from the top of the ranking in fast. Note that the graphs cannot be compared between different combinatorial coverage because the populations of candidate schemas are different. We can observe that the top-k% accuracy of FROGa reaches 1 faster than BEN in every SUTs. In comparing FROGa and BEN, the higher the combinatorial coverage, the faster the top-k% accuracy reaches 1. This trend is similar to MAP.

How FROGa handles false positives depends on the operation. If the computation of the suspiciousness score in the BEN framework is replaced by FROGa, false positives in FROGa should be eliminated by an iterative process that determines what is truly causing the test failure. If one begins debugging by trusting only the suspiciousness score calculated by FROGa, false positives in FROGa will cause delays in the debugging process.

From these results, the answer to RQ1 "Does FROGa improve the accuracy of ranking suspiciousness as a failure-inducing schema compared to BEN?" can be obtained as follows.

Answer to RQ1

Compared to BEN, FROGa can significantly improve the accuracy of ranking the suspiciousness of candidate schemas. In particular, when there is only one failure-inducing schema, FROGa can almost certainly place that schema at the top of the ranking. In addition, although BEN haves low accuracy, FROGa keeps high accuracy when targeting results with large SUT and high combinatorial coverage.

4.5.2 Results for RQ2

Table 12 shows the comparison of the average processing times for BEN and FROGa. Moreover, Fig. 3 shows the comparison of processing time distribution by boxplots. The upper graph compares each SUT, and the lower graph compares each combinatorial coverage. Note that these graphs do not include missing values; for the 4-way testing of Healthcare4 and SPINV. As a result, we can confirm that there is little difference in time cost between BEN and FROGa. Therefore, we can conclude that FROGa is a better method than BEN overall because FROGa can more accurately calculate suspiciousness even though there is little difference in time cost between BEN and FROGa.

Table 12 The average of processing time (s)
Fig. 3
figure 3

Comparison of processing time distribution of BEN and FROGa for each SUT (top) and combinatorial coverage (bottom)

From the results, the answer to RQ2 "How much is the difference in the time cost between FROGa and BEN?" can be obtained as follows.

Answer to RQ2

There is little difference in time cost between BEN and FROGa.

5 Further improvement for cost effectiveness

5.1 Concept of FROGb

The approaches treating all candidate schemas, i. e. BEN and FROGa, are thorough and straightforward way. However, the step to extract all candidate schemas from all possible schemas may occur a combinatorial explosion and require a substantial computational cost when targeting extensive testing results. For example, in the experiment in Section 4, both BEN and FROGa did not complete to extract all candidate schemas within 3,600 seconds when targeting the test result of 4-way tests of Healthcare4 and SPINV. This unrealistic time cost should make lost efficiency in bug repair, which is the primary purpose of FIL.

We believe that combinatorial explosion can be avoided by directly extracting only the most suspicious candidate schemas without dealing with all schemas. We have an idea that logistic regression analysis can solve this problem.

It is a natural idea that test cases that include sub-schemas of a failure-inducing schema are more likely to include the failure-inducing schema than test cases that do not. Furthermore, test cases including sub-schemas of failure-inducing schemas are more likely to fail than test cases that do not. Therefore, the regression coefficient of the sub-schemas of failure-inducing schemas can be high. On the contrary, if the regression coefficient of a schema \(s_x\) is high, \(s_x\) is expected to be a failure-inducing schema or its sub-schemas. Since \(s_x\) must be a candidate schema to be a failure-inducing schema, if \(s_x\) is not a candidate schema but has a high regression coefficient, we can expect \(s_x\) to be a sub-schema of a failure-inducing schema.

There, we built the following hypothesis:

Hypothesis

All sub-schemas of failure-inducing schemas have logistic regression coefficients higher than 0.

The logistic regression coefficients in this hypothesis are obtained the same way as in FROGa. We set the threshold as 0 to get the widest sensitivity. A positive regression coefficient means that a unit increase in that independent variable will positively impact the probability of a positive value for the dependent variable. If this hypothesis is correct, there will be a minimal number of super-schemas that can include all the limited sub-schemas. Therefore, we believe in efficiently reaching failure-inducing schemas while significantly reducing the search space by successively obtaining super-schemas that satisfy the inclusion relation from the 1-degree schema obtained as a sub-schema. Based on this idea, we propose FROGb. FROGb obtains a small number of candidate schemas that are highly suspicious as failure-inducing schemas (after this referred to as high-suspicious candidate schemas) without extracting all candidate schemas from all possible schemas.

5.2 Model

FROGb needs three inputs as same as FROGa: a test suite, the test results for each test case, and k, which is assumed maximum degree of failure-inducing schema. FROGb runs the following steps for each t-degree schema of \(1 \le t \le k\) step by step.

Initial Status:

  Let be \(t = 1\). There are three empty sets: S, C, SubS; S means the set of schemas that are currently in focus. C means the set of high-suspicious candidate schemas. Moreover, SubS means the set of schemas that are considered to be sub-schemas of failure-inducing schemas.

Step 1  (\(t=1\)):

  Extract all 1-degree schemas in the failed test cases and add them to S.

Step 2:

  For each schema in S, check if it is a candidate schema according to the definition. If it is a candidate schema, delete it from S and add it to C.

Step 3:

  For each schema left in S, calculate the logistic regression coefficients and the FROGa procedure, and add the schemas with positive regression coefficients to SubS.

Step 4:

  Increment t by one and initialize S to be empty. Then, go to Step 1.

Step 1  (\(t\ge 2\)):

  Find all t-order schema such that their all (\(t-1\))-degree sub-schemas are included in the set SubS, and add them to S. Then, initialize SubS to empty and go to Step 2.

There are two termination conditions for this algorithm:

  1. 1.

    Terminate when the variable t exceeds the given maximum schema order k, i.e., \(t>k\). To be more precise, there is no need to check whether a k-degree schema is a sub-schema of (\(k+1\))-degree high-suspicious candidate schemas, so immediately terminate after Step 2 (\(t=k\)).

  2. 2.

    Terminate when S is empty at the end of Step 1, or SubS is empty at Step 3, despite \(t \le k\). It is because there is no schema to pass to the next step.

FROGb outputs C at the end of the algorithm as a set of high-suspicious candidate schemas with k-degree or less.

5.3 Application example of FROGb

For a better understanding of FROGb, we show an example application using the example test results shown in Table 2. Table 13 shows the schemas handled in each iteration and their logistic regression coefficients and several judgment results in this application example. In this table, "\(s \in S\)" indicates each schema included in S, "to C" indicates whether be judged a candidate schema or not. Moreover, "to SubS" indicates whether be judged as a sub-schema or not.

Table 13 The handled schemas and their various judgment results in each iteration in FROGb with Example Test Result

As input, we give the example test suite and the result shown in Table 2, and the maximum degree of schema, \(k=3\).

\(t=1\):

  First, FROGb extracts all the 1-degree schemas included in the failed test case and adds them to S. In this example, all possible 1-degree schemas except (-, -, -, 3, -) have been added to S. Next, FROGb checked that all the 1-degree schemas in S are not candidate schemas since they are also included in some passed test cases. Then, we encode the inclusion relations between each schema in S and each test case and run logistic regression to obtain the regression coefficients. Here, the five 1-degree schemas; (1, -, -, -, -), (-, 2, -, -, -), (-, -, 1, -, -), (-, -, -, 2, -), (-, -, -, -, 1) were added to SubS as 1-degree sub-schemas of the failure-inducing schemas because their regression coefficients were higher than 0. Now, let S be empty.

\(t=2\):

  FROGb extracts all the 2-degree schemas included in some failed test cases and whose 1-degree sub-schemas are all included in SubS, then adds them to the empty S. As a result, the nine 2-degree schemas shown in the \(t=2\) of Table 13 were added to S. As an example, the 2-degree schema (1, 2, -, -, -) is included in (1, 2, 1, 1, 1), which is the failed test case #8 in Table 2, and all of its 1-degree sub-schemas (1, -, -, -, -) and (-, 2, -, -, -) are included in SubS. Therefore, it was added to S. Now, let SubS be empty. Next, FROGb checks whether every \(s \in S\) is a candidate schema. As a result, only (1, -, -, 2, -) is a candidate schema, so it is removed from S and added to C. We compute the regression coefficients for the left schema in S, and six 2-degree schemas are added to SubS. Now, let S be empty.

\(t=3\):

  FROGb extracts all 3-degree schemas that are included in some failed test cases and whose all 2-degree sub-schemas are included in SubS, and add them to the empty S. There are only two such schemas, (-, 2, 1, 2, -) and (-, 2, 1, -, 1). Only (-, 2, 1, -, 1) is judged as a candidate schema, so remove it from S and add it to C. Here, FROGb finishes because the first termination condition is satisfied.

Only two high-suspicious candidate schemas (1, -, -, 2, -) and (-, 2, 1, -, 1) were obtained at the end of the algorithm. We can confirm that these are indeed failure-inducing schemas.

6 Evaluation of FROGb

6.1 Research questions

In order to evaluate FROGb, we set the following research questions.

RQ3.:

How much does FROGb reduce the cost of extracting high-suspicious candidate schemas compared to BEN and FROGa?

RQ4.:

How accurate does FROGb extract failure-inducing schemas as high-suspicious candidate schemas?

6.2 Setting

We reused the results in Section 4.5 to compare FROGb with BEN and FROGa. Therefore, the experimental subjects are the same as written in Section 4.2 and the experimental environment is the same as written in Section 4.3

We newly implemented and executed FROGb, and measured several metrics, and compared them with the results of BEN and FROGa had been already obtained.

To answer RQ3, we measure the processing time till FROGb finishes. The FROGb ’s purpose that extracts high-suspicious candidate schemas can achieve by picking up some candidate schemas from the top of the suspiciousness ranking created by BEN and FROGa. Therefore, we correctly compare the measured processing time of FROGb with the processing time of BEN and FROGa. In addition, we also counted the number of high-suspicious candidate schemas extracted by FROGb in order to check whether FROGb can extract only a few high-suspicious candidate schemas or not.

To answer RQ4, we investigated the number of the artificially generated failure-inducing schemas included in the high-suspicious candidate schemas extracted by FROGb.

6.3 Results

6.3.1 Results for RQ3

Table 14 shows the average processing time and two indexes for comparison of BEN, FROGa, and FROGb. The index \(\%red\) represents the average reduction rate of processing time of FROGb, and the index \(\%short\) represents the percentage of cases where FROGb was able to save time. These indexes compare FROGb with the shorter processing times of BEN and FROGa. In addition, Fig. 4 shows a boxplot comparing the distributions of processing time.

Table 14 The average of processing time, the reduction rate (%red), and the shortened cases rate (%short)
Fig. 4
figure 4

Comparison of processing time distribution of BEN, FROGa and FROGb

As a result, the difference in the reduction of processing time depends on the combinatorial coverage. First, there was a slight time reduction by FROGb when targeting 2-way tests. In particular, for 2-way tests of Healthcare4 and SPINV, the average time reduction rate was less than zero, and the processing time increased in more than half of the cases. The reason may be that the processing time overhead by FROGb exceeded the reduction time by FROGb. Next, for the 3-way tests, the time reduction rate was about 50%, and about 96% of all cases reduced the processing time. For the 4-way tests, the time reduction rate was even more pronounced. In particular, for 4-way tests of Healthcare4 and SPINV, FROGb could complete extracting high-suspicious candidate schemas in a few seconds. This result is a remarkable improvement considering that BEN and FROGa could not complete the process within 3,600 seconds.

Table 15 The average number of extracted candidate schemas (#Candidate) and checking opperations to extract candidate schemas (#Check) for each All-ex (i.e., BEN and FROGa) and FROGb, and their reduction rates of FROGb against All-ex (%red)
Fig. 5
figure 5

Comparison of distribution of #Chandidate (top) and #Check (bottom)

For the sake of simplicity, we refer to BEN and FROGa collectively as the all-extraction method (All-ex) because these methods extract all candidate schemas at first, unlike FROGb. The left side of Table 15 (#Candidate row) shows the average number of candidate schemas extracted by the all-extraction method and the average number of high-suspicious candidate schemas extracted by FROGb, and the reduction rates. The right side of Table 15 (#Check row) shows the average number of checking operations whether a schema is a candidate schema or not by the all-extraction method and FROGb, and the reduction rates. Furthermore, the upper part of Fig. 5 illustrates the comparison of the distribution of the number of extracted candidate schemas, and the lower part illustrates the comparison of the distribution of the times of the checking operation.

As a result, high-suspicious candidate schemas extracted by FROGb are fewer than those extracted in the first step of the all-extraction method. For each t-way test, the reduction rates are 50.6%\((t=2)\), 91.5%\((t=3)\), and 99.0%\((t=4)\). The higher the combinatorial coverage, the fewer high-suspicious candidate schemas were output by FROGb. On the other hand, there was no difference in SUT.

The number of checking operations in FROGb decreased as the combinatorial coverage increased. This result is the opposite trend of the one of the all-extraction method. The reason is that the higher the maximum number of iterations k of the FROGb algorithm, the tighter the constraints on the sub-schemas that the output candidate schema must satisfy. For most SUTs with \(k=4\), the average number of high-suspicious candidate schemas was around five. This result shows that FROGb can output only high-suspicious candidate schemas with minimum checking operations.

From the results, the answer to RQ3 "How much does FROGb reduce the cost of extracting high-suspicious candidate schemas compared to BEN and FROGa?" can be obtained as follows.

Answer to RQ3

FROGb can extract only a tiny number of high-suspicious candidate schemas and significantly reduce the processing time compared to BEN and FROGa, especially when targeting testing results with the high combinatorial coverage.

6.3.2 Results for RQ4

Table 16 shows the result of investigating whether the failure-inducing schemas were included in high-suspicious candidate schemas extracted by FROGb. The investigating results are aggregated into three categories: All, Partly, and No. This result shows that FROGb did not always extract all failure-inducing schemas. On average, 63.3% cases extracted all failure-inducing schemas, 32.2% cases extracted only some of the failure-inducing schemas, and only 4.4% cases did not extract any failure-inducing schemas. Moreover, there was little different depending on the combinatorial coverage. On the other hand, the larger SUT, the higher the percentage of all failure-inducing schemas extracted as high-suspicious candidate schemas.

Table 16 The classification rates of how many failure-inducing schemas were included in the high-suspicious candidate schemas extracted by FROGb
Table 17 The percentages of all n failure-inducing schemas were included in the high-suspicious candidate schemas extracted by FROGb for each n

Next, Table 17 shows the percentage of case that all failure-inducing schemas were included in high-suspicious candidate schemas extracted by FROGb, aggregated by the number of failure-inducing schemas. From the result, we can see that the fewer failure-inducing schemas, the higher the percentage. For example, for the 2-way testing results of SPINV with only one failure-inducing schema, FROGb could extract all failure-inducing schemas with very high accuracy, 99.9%. However, for the results with three failure-inducing schemas, the accuracy sharply drops to 68.7%. This fact leads to multiple failure-inducing schemas impacting the failure of the extraction of each other. In addition, this difference in accuracy is also emphasized by the size of SUT. For example, the accuracies of extracting all failure-inducing schemas are 50-70% for HC4 and SPINV, which have many input parameters. On the other hand, the accuracies are only 9-27% for SM, which has the fewest input parameters.

From the results, the answer to RQ4 "How accurate does FROGb extract failure-inducing schemas as high-suspicious candidate schemas" can be obtained as follows.

Answer to RQ4

FROGb can extract all failure-inducing combinations in 63.3% of all and can extract at least one of them in 95.6% of all. This accuracy tends to increase as the size of SUT increases and the number of failure-inducing schemas decreases.

6.4 Discussion

6.4.1 Overall evaluation

Contrary to our hypothesis, the answer to RQ4 reveals that not all sub-schemas of failure-inducing schemas always have positive logistic regression coefficients. Therefore, the FROGb does not necessarily lead to accurate results. However, in our experiments, all failure-inducing schemas were extracted as high-suspicious candidate schemas in about 63.3% of the cases. This accuracy is not low. On the contrary, it is a good tradeoff for reducing the processing time for some targets where the BEN and FROGa take unrealistic processing time.

In addition, the identification of only single defect-inducing schemas can be valuable. For example, if multiple failure-inducing schemas induce the same defect, identifying one of the schemas will help to identify all the defects.

Moreover, consider the case where partial defects could be repaired by identifying some of the failure-inducing schemas. It may be possible to identify all defects by rerunning combinatorial testing and obtaining different test results, step by step. Therefore, we can see that FROGb can obtain effective localization results with 95.6% accuracy in actual.

Based on these considerations, we conclude that FROGb is a very efficient, fast, but approximate FIL approach.

6.4.2 Causes of failure of FROGb

We identified two causes of failure by checking several processing logs where FROGb failed to extract all the failure-inducing schemas.

The first cause is due to the collision of multiple failure-inducing schemas. This collision refers to the situation where there are multiple failure-inducing schemas, and all the primary schemas corresponding to all possible values of a parameter are included in different failure-inducing schemas. FROGb expects the regression coefficients of all the 1-degree sub-schemas of the failure-inducing schema to be positive. However, due to the relativity of regression coefficients, only some of these 1-degree schemas can have positive values. Therefore, FROGb could extract all correct 1-degree schemas as sub-schemas of the high-suspicious candidate schemas and could successfully obtain either of the failure-inducing schemas. For example, consider the \(p_n\), n-th parameter of the SUT, and the \(p_n\) can take two different values, \(p_{n1}\) and \(p_{n2}\). In this case, the regression coefficients of either \(p_{n1}\) or \(p_{n2}\) will always be greater than zero, and the rest will be less than zero.

In our posterior study, there are the collisions in 12,319 cases, out of 44,368 cases where FROGb could not extract all failure-inducing schemas. In addition, all cases involving multiple failure-inducing schemas with the collision failed. This cause of failure explains that the smaller the size of SUT and the larger the number of failure-inducing schemas, the lower the accuracy of extracting all failure-inducing schemas in our experiments. Since the assignment of failure-inducing schemas is determined randomly in our experiment, it is natural that the possibility of the collision increases as the number of failure-inducing schemas increases. In addition, when the number of input parameters of the SUT is large and can take various values, the possibility of the collision decreases.

The second cause is due to just an accident. This accident refers to which input values were assigned by chance to each test case of the designed test suite. We have confirmed the following two cases of accidental failures of identification.

  1. 1.

    There are multiple failure-inducing schemas. Consider the \(p_n\), n-th parameter of the SUT. The \(p_n\) can take two different values, \(p_{n1}\) and \(p_{n2}\). One failure-inducing schema, \(f_1\), contains \(p_{n1}\) as a sub-schema. However, all the test cases failed by failure-inducing schemas other than \(f_1\) happen to contain \(p_{n2}\). In this case, the value of the parameter \(p_n\) appears to be irrelevant to the test failure. Therefore, \(p_{n1}\) is not estimated as a sub-schema of \(f_1\), and the wrong result is output.

  2. 2.

    There are multiple failure-inducing schemas. Consider the \(p_n\), n-th parameter of the SUT. The \(p_n\) can take two different values, \(p_{n1}\) and \(p_{n2}\). One failure-inducing schema, \(f_1\), contains \(p_{n1}\) as a sub-schema. However, the regression coefficient for pn1 will be lower than zero if the test fails by chance more often due to other failure-inducing schemas when \(p_{n2}\) is included than when \(p_{n1}\) is included in the test case. Therefore, \(p_{n1}\) is not estimated as a sub-schema of \(f_1\), and the wrong result is output.

Since we could not find any other causes of failure, all 32,049 failures that the first cause cannot explain must be due to the second cause.

6.4.3 Limitation of FROGb

FROGb should not be used to combinatorial testing results where only one test case failed. This limitation is because FROGb ’s algorithm always predicts all the 1-degree schemas in the failed test case to be sub-schemas of the failure-inducing schema in this case. Therefore, all the candidate schemas are extracted as high-suspicious ones by FROGb, so the result no longer provides helpful information. In such cases, we recommend using an alternative FIL method based on the modification and re-running of a single failure test case, such as the OFOT Nie and Leung (2011).

As we saw, FROGb can not obtain all candidate schemas. Therefore, we cannot establish that a gotten high-suspicious candidate schema is indeed a failure-inducing schema in such a way to run a test case without any other candidate schemas and check the test case to fail. However, if necessary, it is possible to increase confidence by testing several random test cases that include high-suspicious candidate schemas, such as the checking mechanism used in ICT Niu et al. (2020).

7 Threats to validity

We used Python to implement BEN, FROGa, and FROGb. Python is slower than a compiled language, and our implementation and environment may not be optimized. Therefore, the processing time for each method measured in our experiments may be more redundant than the ideal time. However, considering all methods have the same affection on processing speed, the comparison results must be correct.

We calculated the suspicious score in BEN based on the formulas provided in the original paper Ghandehari et al. (2012) and confirmed its accuracy through examples from the same source. However, we didn’t conduct a detailed operational check, posing a potential threat to the validity of our results. Future studies should include thorough validation processes to ensure the reliability of the implemented algorithm.

We used only one type of machine learning package, scikit-learn, and its default parameter values to implement the logistic regression because we had no significant problems with the behavior and results of FROGa and FROGb in this case. On the other hand, we have not yet tested the behavior with other packages and parameters. A more detailed investigation is required to see how changing and optimizing these packages and parameters will change the experimental results and improve the accuracy.

In evaluating FROGa, we didn’t conduct experiments involving iterative identification, as shown in the original BEN paper. Instead, we kept it simple by comparing the accuracy of suspicious scores. This decision was due to the absence of programs with actual bugs as experimental subjects, making adjustments difficult, and being constrained by time during the experiments. Still, because the BEN framework improves suspicion score accuracy over time through iteration, we believe that FROGa ’s high initial accuracy effectively enhances fault localization efficiency. However, it’s important to note that an evaluation aligned with the BEN framework may hold the potential for unknown discoveries, making it a crucial task for future research.

8 Related work

Several FIL approaches have been proposed. These approaches are categorized into adaptive and non-adaptive by involving adaptive generation and execution of additional test cases. Recent survey work about FIL is in Jayaram and Krishnan (2015).

8.1 Adaptive FIL approaches

The adaptive approach mainly identify defect-inducing schemas by designing and executing new test cases based on the execution results of the test cases (Nie & Leung, 2011; Zeller & Hildebrandt, 2002; Wang et al., 2010; Li et al., 2012; Zhang & Zhang, 2011). Some methods use a single failed test case as input, generate and execute additional test cases by changing the value of one of the parameters in the test case, and identify the fault-inducing schema in the original test case based on the changes in the execution results. Many of these methods are known to work only when the number of fault-inducing schemas is one. For example, the OFOT Nie and Leung (2011), and the method using Delta Debugging (Zeller & Hildebrandt, 2002; Wang et al., 2010; Li et al., 2012) do not give correct results in many cases unless the number of fault-inducing schemas is one. In order to address these problems, Zhang and Zhang (2011) proposed an FIC that partially supports the localization of multiple fault-inducing schemas. In addition, Niu et al. (2020) proposed an ICT that performs localization more effectively by dynamically running the processes of single test case design and execution and input value localization in succession. ICT further checks whether the identified failure-inducing schemas lead to test failures by using a checking mechanism that extends the OFOT method, preventing false identifications that occur when there are multiple fault-inducing input value pairs. These methods assume a cycle that; (1) interrupts the sequential execution of test when a failure occurs, (2) identify the failure-inducing schemas, (3) correct the defect components using the identification result, (4) and then test again.

On the other hand, there is also a method that executes all the test cases of a normal covering array and then performs input value localization based on the results of all those executions. Due to the recent development of automated testing techniques through continuous integration, etc., obtaining the test results of the entire covering array has become sufficiently realistic to be considered a promising method. As the most primitive method, Yilmaz et al. (2006) proposed a method to estimate the failure-inducing schema using classification trees. While this estimation does not require additional test cases, it is not guaranteed to be accurate. It is also known that the effectiveness of this approach is highly dependent on the characteristics of the covering array and does not work well if, for example, the majority of the covering array consisting of a small number of test cases fails. Fouché et al. (2009) and Shakya et al. (2012) have proposed an extension of Yilmaz et al.’s work with some improvements. Yilmaz et al. (2014) also devised a framework to feed back estimation results to test case generation.

Another approaches are first to extract all candidate schemas that may be failure-inducing and then generate additional test cases to verify that they are indeed failure-inducing schemas (Shi et al., 2005; Wang et al., 2010; Zhang et al., 2012; Zheng et al., 2016; Niu et al., 2013; Ghandehari et al., 2012, 20152020). The AIFL by Shi et al. (2005) attempts to find a single failure-inducing schema, and the InterAIFL extended by Wang et al. (2010) can find multiple failure-inducing schemas. ComFIL, proposed by Zheng et al. (2016), can find multiple failure-inducing schemas by elimination and generates test cases from all candidate schemas to reduce the most candidates in a single test. In addition, Niu et al.Niu et al. (2013) attempt to optimize the design of additional test cases by constructing a tuple relationship tree to describe the relationship between each candidate schema. The BEN proposed by Gandihari et al. (2015, 2020) also performs efficient input value localization by ranking the candidate schemas based on the calculated suspicious values Ghandehari et al. (2012, 20152020). Several studies seem BEN as a strong candidate for FIL methods (Gargantini et al., 2017; Bonn et al., 2019).

The proposed FROGa is an extension of the existing adaptive method BEN and thus belongs to the adaptive methods. On the other hand, FROGb is a predictive method, so these classifications are not necessarily applicable. The idea of FROGb is based on our earlier work Nishiura et al. (2017). However, in this paper, we give a more strict and detailed model definition, and we also treat a new variety of experimental subjects in the evaluation experiments and make new discussion.

8.2 Non-adaptive FIL approaches

In contrast to the adaptive approach, the non-adoptive approach does not require the creation and execution of additional test cases. The non-adaptive methods mainly generate a Locating Array with a particular input value localization capability when designing a covering array. For example, Colbourn and McClary (2008) proposed a (d, t)-Locating Array. (dt)-Locating Array can uniquely identify d defect-inducing t-degree schemas or less by using a covering array covering all \((d+t)\)-degree schemas. As a similar mathematical object, Martínez et al. (2009) proposed the Error Locating Array. This is based on assuming the structure of the input parameters related to the failure-inducing schemas. In addition, several developments of the Locating Array have been proposed. Hagar et al. (2015) proposed a partial covering array method that uses the known safe input values of the target software in the design. Nagamoto et al. (2014) focused on pairwise testing and proposed a method to effectively generate a (1, 2)-Locating Array from a given test suite. Jin et al. (2018) also proposed the Constrained Locating Array, which is an extension of the Locating Array to deal with constrained input models. A recent survey of such Locating Array research and their applications is also summarized in Colbourn and Syrotiuk (2018).

The advantage of using non-adaptive methods is that the design and execution of the test suite can be wholly separated. On the other hand, the disadvantages are as follows. The developers need to know the number and the degree of failure-inducing schemas to design a locating array. Alternatively, the developers must assume these numbers, and successful localization is successful only if the assumptions are correct. In addition, the number of test cases in a Locating Array is much larger than in a naive covering array, and thus the execution cost is much higher. These limitations are not very attractive in practical use, and these limitations have not been fundamentaly resolved now.

9 Conclution

This paper proposed two novel faulty interaction localization approaches using logistic regression analysis: FROGa and FROGb. FROGa improves the accuracy of computing suspiciousness of combinations of parameter values by using that logistic regression. In addition, FROGb avoids a combinatorial explosion by estimating the subsets of failure-inducing combinations and exploring their supersets. The estimation of subsets also uses the logistic regression coefficients. Through evaluation experiments using a large number of artificial test results based on several real systems, we observed that: FROGa has very high accuracy, and FROGb can drastically reduce computing cost for targets that have been difficult to complete to identify by the conventional method.

Our research leaves several challenges. One of them is to improve the accuracy by various logistic regression implementation methods. Another is to quantitatively evaluate the accuracy improvement and cost increase by using additional test designs and reruns. We would also like to evaluate it by applying it to faults detected by combinatorial tests, which are not artificially generated but reported.