Keywords

1 Introduction

Manual debugging is a difficult task that consumes a lot of resources in software development process [21]. It is reported that up to \(80\%\) of the total software development budget might be consumed by debugging tasks [20]. To address this problem, a wide variety of Automated Fault Localization (AFL) techniques have been established in the literature to assist developers at locating the root causes of failures [23]. There are several approaches to automated fault localization such as slicing-based [12, 22, 27], machine-learning-based [14, 26, 30], and spectrum-based fault localization [1, 6, 10, 19, 28]. The Spectrum-based Fault Localization (SFL) approach has been shown to be competitive compared to the rest [16]. Also, SFL is a lightweight approach, and it can be applied to large-scale programs [29].

SFL techniques execute a given program with an existing set of passing and failing test cases. Then, leveraging program spectra [23] (i.e. program execution traces of test cases), and employing a ranking metric [26], the suspiciousness scores of program entities are computed. Program entities are source code elements with any granularity such as statements, methods, and basic blocks. Suspiciousness scores indicate the likelihood of each program entity to be faulty, and ranking metrics assign higher suspiciousness scores to entities covered by more failing tests and fewer passing ones. After the computation of suspiciousness scores, program entities are sorted according to their suspiciousness scores and handed to developers or automated program repair techniques [13]. Finally, the source code is examined from the most suspicious entity to the least suspicious one with the purpose of diagnosing the root causes of failures.

Several SFL techniques such as Ochiai [1] exist in the literature, each of which performs effectively on specific programs while not ranking entities of other programs, appropriately [28]. In other words, for most programs, current techniques assign higher suspiciousness scores to program entities that are not related to the fault at hand [12]. Our intuition is that this issue can be addressed if program characteristics are considered while suspiciousness scores are computed, which is also mentioned by Wong et al. [23]. The semantics and structures of programs are two examples of program characteristics. We believe that program characteristics can lead us toward finding right SFL techniques (among the existing ones) for any given program. Also, we hypothesize it can assist us at combining various existing ranking metrics (i.e., SFL techniques) to produce more effective ranking metrics, explicitly customized for a given program.

In this paper, we present an approach that combines various ranking metrics to generate an effective one for a given program. In this approach, first, using mutation testing [9], several mutants are produced for the given program which are then executed by an existing test suite. Then, runtime data such as program spectra generated for the mutants are collected which are employed as a representation of program characteristics. Afterward, these runtime data are utilized to compute the effectiveness of 40 state-of-the-art ranking metrics. In the end, considering the effectiveness calculated for these ranking metrics and employing preferential voting systems [2, 4, 11, 18], the 40 ranking metrics are combined to generate a new ranking metric. We evaluate our approach using 154 faulty versions of the Siemens suite and the Space program and compare it with nine state-of-the-art SFL techniques. According to the experimental results, the ranking metrics produced by our approach always perform more effective compared to the nine comparative ranking metrics, regarding well-known evaluation metrics such as the Exam score and TOP-N.

The remainder of this paper is structured as follows: Sect. 2 reviews preliminary materials and related work; Sect. 3 presents the proposed approach of this paper; Sect. 4 provides the experimental results and discussions; Sect. 5 concludes this work.

2 Background and Related Work

In the following, Sect. 2.1 provides a brief description of the preliminary materials related to our work, and Sect. 2.2 reviews some of the state-of-the-art automated fault localization techniques.

2.1 Preliminaries

Spectrum-Based Fault Localization. The goal of Spectrum-based Fault Localization (SFL) techniques is to locate faulty program entities such as statements, methods, and basic blocks. SFL techniques take as input a faulty program and two sets of test cases. One of these sets contains failing test cases while the other set has passing ones. Afterward, it collects program execution traces of the test cases, referred to as program spectra [23], by instrumenting and executing the given program, using the failing and passing test cases. Each program spectrum reports information regarding program entities that are executed by a test case. Various tools can record program spectra. For instance, in our experiments, we use Gcov [8] to instrument programs and retrieve runtime data. Based on program spectra, several statistics are computed for each program entity \(e_{j}\) such as \(N_{CF} (e_{j})\), \(N_{CS} (e_{j})\), \(N_{UF} (e_{j})\), and \(N_{US} (e_{j})\), which are the number of failing and passing (successful) test cases covering \(e_{j}\), and the number of failing and passing test cases not covering \(e_{j}\), respectively. Using these statistics, and employing a ranking metric [26] such as Ochiai [1], which is shown in Eq. (1), SFL techniques compute the suspiciousness score of every program entity. After computing the suspiciousness scores, the program entities are sorted and handed to developers or automated program repair techniques [13] to assist them in their debugging task.

$$\begin{aligned} Score_{Ochiai}(e_{j}) = \frac{N_{CF} (e_{j})}{\sqrt{(N_{CF} (e_{j}) + N_{UF} (e_{j})) \times (N_{CF} (e_{j}) + N_{CS} (e_{j}))}} \end{aligned}$$
(1)

Mutation Testing. As a testing technique, mutation testing [9] is used to measure the effectiveness of test suites regarding their ability to detect faults in programs. This technique produces several mutants \(p_i\) (\(1< i < m\)) for a program p by seeding it with m faults. Faults are seeded by employing mutation operators, which perform syntactical modifications to programs, such as replacing a relational operator by another one. Then, the mutants are executed against the whole test suite. If the result or behavior of a mutant \(p_i\) is different compared to p, \(p_i\) is said to be killed. The higher the number of killed mutants, the more effective the test suite is. Besides being the most successful metric to measure test suite effectiveness [3], mutation testing can be used for other purposes, as well. For example, state-of-the-art automated program repair techniques such as ELEXIR [17] apply various mutation operators for patch generation. In this paper, we use mutation testing to measure fault localizing capability of SFL techniques for a given program. We generate several mutants for the program at hand and then, compute the effectiveness of SFL techniques at finding the faults in these mutants.

Preferential Voting System. Ranked voting refers to special electoral systems in which voters can vote for more than one candidate and sort them in their ballots in order of their preferences. This type of ballot, referred to as ranked ballot, contains more information compared to those that only mention one candidate. Therefore, they must be processed and aggregated using certain methods specific to them called preferential voting systems. There are various preferential voting systems in the literature each of which is subject to criteria such as monotonicity which states that when a candidate is the winner of the election, changing a ballot in favor of this candidate must still keep it as the winner of the election. Reviewing these criteria and the advantages of different preferential voting systems are beyond the scope of this paper, and we encourage interested readers to refer to [5] for further details. For this research, we choose four preferential voting systems Instant Run-Off Voting [4], Kemeny-Young [11], Condorcet [2], and Schulze [18] because of their popularity among researchers. We use these systems to aggregate ranking ballots produced by different mutants which act as voters that prioritize various SFL techniques (ranking metrics) in their ballots.

2.2 Automated Fault Localization Techniques

There are hundreds of studies about Automated Fault Localization (AFL) techniques [23]. Program slicing-based AFL techniques obtain a slice for a given program by collecting its executable statements that might have an impact on the value of a specified variable. Xuan and Monperrus [27] proposed a method called test case purification which utilizes program slicing to reduce failing test cases with several assertions into several test cases with only one assertion. They also indicated that employing test case purification improves the fault detection capability of spectrum-based fault localization techniques. Mao et al. [12] proposed a novel approach which first employs program slicing to identify program entities that affect the given program output, and then, it uses a spectrum based fault localization technique to rank the remaining entities with respect to their suspiciousness. Wang et al. [22] presented a debugging framework called DrDebug that enables users to debug multi-threaded programs while focusing on a specific slice.

Machine learning, a field of artificial intelligence, has been used in various studies on different software engineering tasks such as automated program repair [17] It has also been used in automated fault localization. Xuan and Monperrus [26] employed machine learning to present a fault localization technique that estimates the suspiciousness of program entities by automatically combining 25 ranking metrics. Zhang and Zhang [30] employed a Markov logic network to compute the suspiciousness of program statements. Nath and Domingos [14] presented a probabilistic-based fault localization technique that finds faults according to the bug patterns it learns. This technique has the capability of employing the output of spectrum-based fault localization techniques as features, and can be trained on a set of faulty programs.

Spectrum-based fault localization is probably the most studied approach in the field, which is thoroughly reviewed in [19]. The first ranking metric, Tarantula, was proposed by Jones et al. [10] which is based on the idea that program entities covered by more failing and fewer passing test cases are the most suspicious ones of being the root causes of failures. Dallmeier et al. [6] proposed Ample as a plug-in for the Java IDE Eclipse to locate faults in object-oriented programs. Abreu et al. [1] studied three widely used ranking metrics Tarantula, Ample, and Ochiai and reported that Ochiai outperforms the other two techniques. Yoo et al. [28] studied different ranking metrics and realized that some of them are equivalent and do not dominate each other. They also concluded that there is not a ranking metric that outperforms all the other ranking metrics for every program.

The proposed approach of this paper is not based on program slicing and does not employ machine learning. Our approach combines several existing SFL techniques using preferential voting systems and mutation testing. In this regard, it is different from the studies mentioned above.

3 Proposed Approach

This section presents the proposed approach of this paper, by which an effective ranking metric is produced for a given program. As illustrated in Fig. 1, the proposed approach receives three different inputs: (1) a program p, for which a new ranking metric is produced; (2) a test suite TS; (3) n existing ranking metrics. Following two phases, the proposed approach generates a new ranking metric for p by combining the n given ranking metrics.

Fig. 1.
figure 1

Overall structure of the proposed approach

At the first phase, various mutation operators are applied to p to generate m mutants for it. Then, the mutants are executed by every test case in TS, and the execution results are collected and passed to the second phase (see more details in Sect. 3.1). At the second phase, for each mutant, the effectiveness of every n ranking metric is computed, employing the output of the first phase. Then, these ranking metrics are combined based on their effectiveness so that a new ranking metric is generated that is more effective for p, compared to each of the n given ranking metrics, individually (see more details in Sect. 3.2).

3.1 Phase 1: Information Retrieval

The proposed approach generates ranking metrics specific to a given program. To this end, the characteristics of the given program must be retrieved and taken into consideration. The purpose of this phase is to collect this information, employing mutation testing. As illustrated in Fig. 2, this phase comprises two steps.

Fig. 2.
figure 2

Details of phase 1

Step 1: Mutant Generation. At this step, m mutants are generated for the given program p, subject to three criteria: (1) the test suite TS must be capable of killing them all; (2) the mutants must be free of any infinite loops; (3) executing the mutants on TS must not result in any crashes or runtime errors. Those mutants that do not satisfy the criteria, mentioned above, are thrown away, and new mutants are generated to replace them. The following mutation operators are randomly used to seed a fault in a randomly selected statement:

  • modifying a character or numerical literal

  • changing a relational operator (e.g., >)

  • changing a logical operator (e.g.,&&)

  • replacing a function call by another one with the same signature

  • replacing a variable by another variable of the same type

  • inserting a statement

  • replacing a predicate with TRUE or FALSE.

Step 2: Execution. At this step, each mutant, produced at the previous step, is executed by every test case in TS. As a result, for each mutant, a matrix is produced known as program spectra for that mutant, and we refer to it as the coverage matrix. The output produced after executing each mutant using TS is also collected. By comparing a mutant’s output with the output produced for p, the execution results for that mutant is obtained. For instance, Fig. 3 shows an example of a coverage matrix collected for a mutant, along with its execution results. Column one shows the five test cases within the test suite. Column two through eight illustrate the coverage matrix, where 0s and 1s indicate that program entity \(e_{i}\) is covered and not covered, respectively, while executed by test case \(t_{i}\). Column nine contains the execution results for the mutant, where 0s and 1s indicate that test case \(t_{i}\) is failed and passed, respectively.

Fig. 3.
figure 3

Example of a coverage matrix and execution results produced for a mutant.

3.2 Phase 2: Ranking Metric Generation

According to Fig. 4, phase 2 comprises three different steps. Following these steps, the n given ranking metrics are combined to produce a new ranking metric for p.

Step 1: Generating Ranked Ballots. At this step, for each of the m mutants, the effectiveness of the n ranking metrics are computed, using the coverage matrices and execution results produced at the previous phase. By doing so, m ranked ballots are produced, each of which contains the n ranking metrics listed according to their effectiveness at locating the fault within the corresponding mutant. Table 1 illustrates an example of 45 ranked ballots produced for a program, while \(m=45\), and \(n=5\). Column 2 and 4 show the ranked ballots, and column 1 and 3 indicate the number of instances of each ballot. For example, according to this table, for five different mutants, the sequence “T\(_1\) > T\(_2\) > T\(_3\) > T\(_4\) > T\(_5\)” has been produced as the ranked ballot. This ballot states that T\(_1\) and T\(_5\) are the most and the least effective ranking metrics at locating the faults within these five mutants.

Step 2: Selecting Ranking Metrics. At this step, the ranked ballots produced at the previous step are aggregated into an ordered list of ranking metrics, using one of the two preferential voting systems Instant Run-Off Voting [4] and Kemeny-Young [11]. For instance, applying Instant Run-Off Voting to Table 1 produces “T\(_2\) > T\(_3\) > T\(_1\) > T\(_4\) > T\(_5\)”, and using Kemeny-Young results in “T\(_4\) > T\(_3\) > T\(_1\) > T\(_5\) > T\(_2\)”. Then, as the output of this step, k best ranking metrics are selected among the resulting list, which is referred to as B. For example, for \(k=4\), using Instant Run-Off Voting and Kemeny-Young results in B = [T\(_1\),T\(_2\),T\(_3\),T\(_4\)] and B = [T\(_1\),T\(_3\),T\(_4\),T\(_5\)], respectively.

Fig. 4.
figure 4

Details of phase 2

Table 1. Example of ranked ballots produced at step 1 of phase 2

Step 3: Combining Ranking Metrics. As illustrated in Fig. 4, this step receives B, which contains the k best ranking metrics selected at the second step. It also gets the ranked ballots produced at the first step. Then, using Eq. (2), a new ranking metric is generated, which is the output of the proposed approach.

$$\begin{aligned} NewScore(e_j) = \sum _{i=1}^{k}{w_{B_{i}} \times NormScore_{B_{i}}(e_j)} \end{aligned}$$
(2)

In Eq. (2), \(e_j\) represents program entities in p, for which suspiciousness scores are computed; the term \(w_{B_{i}}\) is the weight computed for ranking metric \(B_{i}\) according to its effectiveness at locating faults in p; the term \(NormScore_{B_{i}}(e_j)\) is the normalized suspiciousness score computed by ranking metric \(B_{i}\) for \(e_j\), employing the feature scaling method presented in Eq. (3).

$$\begin{aligned} NormScore_{T}(e_j) = \frac{Score_{T}(e_j) - min_{T}}{max_{T} - min_{T}} \end{aligned}$$
(3)

Equation (3) standardizes the range of suspiciousness scores that a given ranking metric (T) computes by scaling them in the range [0, 1]. The term \(Score_{T}(e_j)\) is the suspiciousness score computed by T for program entity \(e_j\); the terms \(min_{T}\) and \(max_{T}\) are respectively the minimum and maximum suspiciousness scores computed by T for all of the program entities in p.

The terms \(w_{B_{i}}\) (\(1<i<k\)) in Eq. 2 are determined by employing one of the two preferential voting systems Condorcet [2] and Schulze [18], and using the ranked ballots produced at the first step. In case of using Condorcet, first, Condorcet’s pairwise matrix of the given ranked ballots is produced which indicates the number of times each ranking metric has been more effective compared to the rest of them. Figure 5a shows an example of a pairwise matrix computed for the ballots in Table 1. Afterward, the terms \(w_{B_{i}}\) (\(1<i<k\)) are calculated using Eq. (4), where M is Condorcet’s pairwise matrix. For instance, considering Fig. 5a as the pairwise matrix, \(w_{B_1}\), \(w_{B_2}\), \(w_{B_3}\), and \(w_{B_4}\) are \(\frac{68}{270}=0.251\), \(\frac{72}{270}=0.266\), \(\frac{59}{270}=0.218\), and \(\frac{71}{270}=0.262\), respectively.

$$\begin{aligned} w_{B_{i}} = \frac{1}{\sum _{i=1}^{k}\sum _{j=1,i\ne j}^{k} M[i,j]} \sum _{j=1, j \ne i}^{k}{M[i, j]} \end{aligned}$$
(4)
Fig. 5.
figure 5

Example of a pairwise and strength matrix produced for the lists in Table 1. (a) Pairwise matrix; (b) Strength matrix.

In case of using Schulze, first, Schulze’s strength matrix is computed for the given ranked ballots. This matrix illustrates the strengths of the strongest paths for each pair of ranking metrics. In other words, it indicates how effective a ranking metric has performed compared to the other ranking metrics (for further details on strongest paths refer to [18]). Then, the weights are computed employing Eq. (4), where M is Schulze’s strength matrix. Figure 5b indicates an example of a strength matrix calculated for the lists in Table 1. Using this matrix, \(w_{B_1}\), \(w_{B_2}\), \(w_{B_3}\), and \(w_{B_4}\) are \(\frac{76}{305}=0.249\), \(\frac{78}{305}=0.255\), \(\frac{74}{305}=0.242\), and \(\frac{77}{305}=0.252\), respectively.

4 Experiments

In this section, we present the evaluation of the proposed approach. Section 4.1 reviews the experiment setup; Sect. 4.2 provides the results of the experiments; Sect. 4.3 presents the discussion; Sect. 4.4 explains the threats to the validity of the experimental results.

4.1 Experiment Setup

Subject Programs. The proposed approach is evaluated on eight popular programs, the Siemens suite along with the Space program, provided by Software-artifact Infrastructure Repository (SIR) [7], which has been employed by various fault localization studies. Table 2 illustrates the details of these programs. The first row shows each program’s size. Row two indicates the number of faulty versions we have used in our experiments, each of which contains a single bug. Row three shows the size of each program’s test suite, and row four illustrates the number of mutants generated for the programs, which is the parameter m of the proposed approach. During the experiments, we made sure that the generated mutants were different from their corresponding faulty versions by analyzing them, manually.

Table 2. Subject programs.

Evaluation Metrics. To evaluate the effectiveness of the proposed approach, we used three metrics of evaluation, which are defined as follows:

1. Exam: The Exam score [23] indicates the percentage of code that needs to be inspected to locate the fault within a program. This metric is used to compare AFL techniques on a single program while in our experiments, we had 154 faulty versions of eight different programs (see Table 2). Therefore, for any ranking metric T, we computed T’s Exam score on every faulty version, and then, reported the mean of these 154 resulting scores as the Exam score of T. A lower value of this metric indicates higher effectiveness.

2. Proportion of Located Faults: This evaluation metric indicates the percentage of faults located while a specific percentage of program entities are inspected. To compute this metric for a ranking metric T, the top \(10\%\) of the program entities in each faulty version were inspected, and the percentage of located faults was reported. A higher value of this metric indicates higher effectiveness.

3. TOP-N: This metric is similar to the previous one with the only difference that in this metric, instead of a specific percentage of program entities, a certain number of them are inspected. Considering the fact that regardless of the size of programs, developers usually inspect a few of the top-ranked program entities presented by AFL techniques [15], this metric is important in practice. In our experiments, to compute this metric for a ranking metric T, top ten program entities in each faulty version were examined, and the number of located faults were reported as T’s TOP-10 score. Note that a higher value of this metric indicates higher effectiveness.

Configuration and Implementation. We utilized the 40 state-of-the-art ranking metrics presented in Table 3 as the third input to the proposed approach, and thus, in our experiments, n was always 40. As stated in Sect. 3, for the task of selecting k best ranking metrics, the proposed approach can use one of the two methods Instant Run-Off Voting and Kemeny-Young, and to perform the task of combining these ranking metrics, it may employ Condorcet or Schulze. As a result, four different instances of the proposed approach was implemented, each of which utilizes one of the two possible preferential voting systems for these two tasks. These four instances are: “Instant Run-Off Voting + Condorcet”, “Instant Run-Off Voting + Schulze,” “Kemeny-Young + Condorcet,” and “Kemeny-Young + Schulze,” which we refer to as IRV-C, IRV-S, KY-C, and KY-S, respectively.

Table 3. Ranking metrics used in the experiments.

All of the four instances of the proposed approach were implemented in C++, and the experiments were conducted on a virtual machine with Intel Core i5 CPU at 1.60 GHz, 2 GBs of RAM, and the 64-bit version of Ubuntu 16.04. To instrument code and retrieve runtime data, we employed Gcov [8], the GNU coverage testing tool that considers code lines as program entities.

4.2 Results

In this section we present the results of comparing the four instances of the proposed approach namely KY-S, IRV-S, KY-C, and IRV-C with nine state-of-the-art ranking metrics Naish2 [24], Zoltar [23], GP13 [25], Ochia [1], Tarantula [10], Jaccard [23], GP03 [25], GP02 [25], and Wong [24]. Figure 6a shows the results of the effectiveness comparison with respect to the first evaluation metric presented in Sect. 4.1. According to the results, the four instances of the proposed approach perform better than the rest of the ranking metrics. Also, KY-S shows better effectiveness compared to the other three instances of the proposed approach. Furthermore, the results indicate that fault localization effectiveness can be increased by up to 62% using KY-S.

Fig. 6.
figure 6

Experimental results. (a) Exam scores; (b) proportion of located faults; (c) proportion of faults located with respect to inspected program entities; (d) TOP-10 scores.

Figure 6b compares the effectiveness of the proposed approach with other ranking metrics with respect to the second evaluation metric presented in Sect. 4.1. The purpose of this experiment is to evaluate the proposed approach while only a small portion of program entities (in our case \(10\%\) of them) are examined, which is an important perspective since developers tend not to examine every program entity presented by AFL techniques. Based on the results, the proposed approach has the best effectiveness compared to other ranking metrics, and again, KY-S performs better than the other three instances of the proposed approach. To further investigate the effectiveness of the proposed approach, we also compared KY-S, and KY-C with Naish2, Ochiai, and GP13 while the portion of inspected program entities varied from \(20\%\) to \(50\%\), and the results are illustrated in Fig. 6c. As can be seen, no matter how many program entities are inspected, KY-S is always superior.

Figure 6d shows the results of comparing the proposed approach with other ranking metrics, regarding the third evaluation metric presented in Sect. 4.1, which indicates the number of located faults by each ranking metric while only ten program entities are inspected. According to the results, KY-S is superior compared to the other ranking metrics.

4.3 Discussions

According to the experimental results presented in Sect. 4.2, the preferential voting system used at step 3 in phase 2, which combines the best ranking metrics, has a significant impact on the effectiveness of the generated ranking metric. Considering the experimental results, employing the Schulze method results in ranking metrics that are more effective than those produced by the Condorcet method. We believe that this advantage is rooted in the ability of Schulze in considering the transitive relation between the ranking metrics in ranked ballots produced at step 1 in phase 2. In other words, compared to Condorcet, the Schulze method can more appropriately determine the effectiveness of different ranking metrics based on given ranked ballots.

Another important factor for generating an effective ranking metric is the preferential voting system employed at step 2 in phase 2, which selects k best ranking metrics among n. To investigate the impact of this factor, we removed this step by setting k to n, and then, repeated the experiments. By doing so, the Exam score of KY-S growed from \(21.36\%\) to \(31.54\%\) (which demonstrates a decline in its effectiveness), and also the effectiveness of KY-S with respect to the second and the third evaluation metrics presented in Sect. 4.1 decreased from \(42.3\%\) to \(16.8\%\), and from 63 bugs to 28 bugs, respectively. The parameter k also has a significant influence on the effectiveness of the proposed approach. To investigate the impact of this parameter, we repeated the experiments by setting k as 5, 20, and 40. The results of this experiment is illustrated in Table 4, according to which KY-S has the best effectiveness for \(k=5\).

Table 4. Sensitivity analysis of the parameter k for KY-S.

4.4 Threats to Validity

The most critical threat to the validity of our experimental results is whether they generalize to other programs. We have evaluated the proposed approach using the Siemens suite which comprises relatively small programs. However, these programs have been employed by many researchers in the field, and also, we tried to mitigate this issue by using 35 faulty versions of the Space program which are quite larger compared to the items in the Siemens suite.

In addition, the type of mutants generated at step 1 in phase 1, and the number of ranking metrics selected at step 2 in phase 2 (the parameter k) can also affect the experimental results, and thus, they are considered as other threats to the validity of our results.

5 Conclusions

In this paper, we presented an approach to generate SFL ranking metrics for programs by combining various existing ranking metrics. We implemented four instances of the proposed approach based on the preferential voting systems used for two different tasks within the approach. All four instances of the proposed approach were evaluated using the Siemens suite and the Space program, and they were compared with nine state-of-the-art ranking metrics. According to the results, using Kemeny-Young for selecting the best ranking metrics and employing Schulze to combine them leads to better ranking metrics compared to the other three instances of the proposed approach. Also, all four instances of the proposed approach generate ranking metrics that are more effective than the baselines with respect to the evaluation metrics such as the Exam score and TOP-N.

In this work, we used four different preferential voting systems while there are many other such systems that we plan to investigate their impact on our approach. Also, to reduce the threat to the validity of our results, we are going to evaluate our approach on Object-oriented, real-world, and large-sized programs, as well. Since each subject program used in our experiments had only one bug, we are going to evaluate our approach on programs with multiple bugs, as well.