We now present our experimental results. We first look at the dataset and RGT tests we have collected.
Patches
We have collected a total of 638 patches from 14 APR systems. All pass the sanity checks described in Section 3.6. Table 4 presents this dataset of patches for Defects4J. The first column specifies the dataset category and the second column gives the name of the automatic repair system. The number of patches collected per project of Defects4J is given in the third to the seventh columns and they are summed at the last column. They are 257 patches previously claimed as correct, which form Dcorrect. There are 381 patches that were considered as overfitting by manual analysis in previous research, they form Doverfitting. We found 160/257 patches from Dcorrect are syntactically equivalent to the human-written patches: the exact same code modulo formatting, and comments. The remaining 97/257 patches are semantically equivalent to human-written patches. Overall, the 638 patches cover 117/357 different bugs of Defects4J.Footnote 4 To our knowledge, this is the largest ever APR patch dataset with manual analysis labels by the researchers. The most related dataset is from Xin and Reiss(2017a) containing 89 patches from 4 repair tools and the one fromXiong et al. (2018) containing 139 patches from 5 repair tools. Our dataset is 4 times larger than the latter.
Table 4 Dataset of collected Defects4J patches Tests
Evosuite and Randoop have been invoked 30 times with random seeds for each of the 117 bugs covered by the patch dataset. In total, they have been separately invoked for 117 bugs × 30 seeds = 3510 runs.
Coverage
To better understand the generated RGT tests, we compute their coverage over the buggy classes. Figure 1 illustrates the code coverage distribution on the buggy classes by 3510 generated test suites in five Defects4J projects. We use JacocoFootnote 5 to measure the branch coverage on the buggy classes. The orange legend shows the code coverage distribution by RGTEvosuite2019 test suites while the blue one represents the coverage of RGTRandoop2019 test suites. For example, in the Chart project, the code coverage ratios of RGTEvosuite2019 are mostly over 80% while the coverage of RGTRandoop2019 is uniformly distributed between 0% and 100%. Therefore, the code coverage by RGTEvosuite2019 is considered higher than the RGTRandoop2019.
Overall five projects, we observe that the code coverage by the RGTEvosuite2019 is higher than the code coverage by the RGTRandoop2019 test suites. For Chart, Lang, Math and Time projects, RGTEvosuite2019 test suites achieve high code coverage on the buggy classes: the 90% percentile is higher than 80%. On the contrary, the code coverage by RGTRandoop2019 is clearly lower. The reasons are as follows: First, RGTEvosuite2019 suffers from fewer test generation failures: among 3510 random test suite generations, Evosuite fails to produce RGT tests in 31 runs while Randoop fails in 1080 runs, which lead to a respectively a 0.9% and a 30.8% failure rate. Second, Evosuite applies a genetic algorithm in order to evolve test cases that maximize code coverage, which has been consistently shown to be better than Randoop (Shamshiri et al. 2015; Kifetew et al. 2019).
Notably, the code coverage on the Closure project is significantly lower than for the other four projects, both for RGTEvosuite2019 and RGTRandoop2019 test suites. We found two reasons that explain that: 1) the Closure project requires test data with a complex data structure, which is a known hard challenge for automatic test generators; 2) the Closure project has a majority of private methods, which are not well handled by the considered test generation tools.
Flaky Tests
We discard 2.2% and 2.4% flaky tests from RGTEvosuite2019 and RGTRandoop2019 respectively with a strict sanity check. As a result, we have obtained a total of 4,477,707 stable RGT tests: 199,871 by RGTEvosuite2019 and 4,277,836 by RGTRandoop2019.
We also collect RGT tests generated by previous research, they are 15,136,567 tests: 141,170 in RGTEvosuiteASE15 (Shamshiri et al. 2015), 14,932,884 in RGTRandoopASE15 (Shamshiri et al. 2015), and 62,513 in RGTEvosuiteEMSE18 (Yu et al. 2018). By conducting a sanity check of those tests, we discard 2.7%, 4.7% and 1.1% flaky tests. Compared with the newly generated RGT tests, more flaky tests exist in previous generated tests due to external factors such as version, date and time (Shamshiri et al. 2015) (#F9). To our knowledge, this is the largest ever curated dataset of generated tests for Defects4J.
Result of RQ1: RGT Patch Assessment Contradicts Previously Done Manual Analysis
We have executed 30 runs of RGT tests over 257 patches from Dcorrect. For the 160 patches syntactically equivalent to the ground truth patches, the results are consistent: no RGT test fails. For the remaining 97 patches, the assessment of 16 patches contradicts previously reported manual analysis (at least one RGT test fails on the patch considered as correct in previous research). This makes 10/16 true positive cases while the 6/16 are false positives according to our manual analysis. Due to the potential risk of false negatives with RGT tests, we also manually analyze the remaining 81 semantically equivalent patches which do not make any RGT test fail, the result is discussed in Section 7.1.
The ten true positive cases are presented in Table 5. The first column gives the patch name, with the number of the failing test by each RGT category in the second and third columns. The fourth column shows the category of behavioral difference defined in Table 3. The last column gives the result of the conversation we had with the original authors about the actual correctness of the patch. For instance, the misclassified patch of patch1-Lang-35-ACS is identified as overfitting by 10 tests from RGTEvosuite2019 and it is exposed by behavioral difference category Dexc2 of non-semantically behavior: no exception thrown from a ground truth program but exceptions caused in a patched program execution. This result has been confirmed by the original authors.
Table 5 Misclassified patches found by RGT. The original authors agreed with the analysis error RGTEvosuite2019 and RGTRandoop2019 identify 10 and 2 misclassified patches individually. This means that Evosuite is better than Randoop on this task. Now we look at the behavioral differences of those 10 misclassified patches which are exposed by four categories of behavioral differences. This shows the diversity of behavioral differences is important for RGT assessment.
Notably, the 10 misclassified patches are from 6/14 repair systems, which shows the misclassification in manual patch assessment is a common problem (#F1). This shows the limitation of the manual analysis of patch correctness. The 10.3% (10/97) previously claimed correct semantically equivalent patches were overfitting, which shows that manual assessment of semantical APR patches is hard and error-prone. A previous research Wang et al. (2019) reported over a quarter of correct APR patches are actually semantic patches, and this warns us should pay careful attention in assessing their correctness. All patches have been confirmed as misclassified by the original authors. Five researchers gave us feedback that the inputs sampled by the RGT technique were under-considered or missed in their previous manual assessment. The RGT assessment samples corner cases of inputs that assist researchers in manual assessment (#F2).
We now present a case study to illustrate how those patches are assessed by RGT tests.
Case Study of Lang-43
The CapGen repair tool generates three patches for bug Lang-43. Those three patches are all composed of a single inserted statement next(pos) but the insertion happens at three different positions in the program. Among them, there is one patch that is identical to the ground truth patch (Listing 2a). It inserts the statement in an if-block. Patches patch1-Lang-43-Capgen (Listing 2b) and patch2-Lang-43-Capgen (Listing 2c) insert the correct statement but at different locations, respectively 1 line and 2 lines before the correct position from the ground truth patch. Both patches are classified as overfitting by RGT, because 10 sampled inputs result in a heap space error. With the same inputs, the ground truth patch performs without exception, this corresponds to the category Derror in Table 3. The original authors have confirmed the misclassification of these two patches. This case study illustrates the difficulty of APR patch assessment: it is unlikely to detect a heap memory error by only reading over the source code of the patch.
Result of RQ2: False Positives of RGT Assessment
Per the protocol described in Section 3.5, we identify false positives of RGT assessment by manual analysis of the patches where at least one RGT test fails. Over the 257 patches from Dcorrect, RGT patch assessment yields 6 false positives. his means the false positive rate of RGT assessment is 6/257 = 2.3% (#F3).
We now discuss the 6 cases that are falsely classified as overfitting by RGT assessment. They are classified into four categories according to the root causes and described in the first column in Table 6. The second column presents the patch name, the third column shows the category of behavioral difference as defined in Table 3. The fourth column gives the RGT test set that contains the failing test and the last column gives a short explanation.
Table 6 False positive cases by RGT assessment
PRECOND
The patch from patch1-Math-73-Arja is falsely identified as overfitting because RGT samples inputs that violate implicit preconditions of the program (#F4). Listing 3 gives the ground truth patch, the Arja patch and the RGT test that differentiates the behavior between the patches. In Listing 3c, we can see that RGT samples a negative number −1397.1657558041848 to update the variable functionValueAccuracy. However, the value of functionValueAccuracy is used to compare absolute values (see the first three lines of Listing 3a). It is meaningless to compare the absolute values with a negative number, an implicit precondition is that functionValueAccuracy must be positive, but there is no way for the test generator to infer this precondition.
This case study illustrates that RGT patch assessment may create false positives because the used test generation technique is not aware of preconditions or constraints on inputs. This confirms the challenge of Evosuite for sampling undesired inputs (Fraser and Arcuri 2013). On the contrary, human developers are able to guess the range of acceptable values based on variable names and common knowledge. This warns us that better support for preconditions handling in test generation would help to increase the reliability of RGT patch assessment.
EXCEPTION
Both patch1-Lang-7-SimFix and patch1-Lang-7-ACS throw the same exception as the one expected in the ground truth program: fail (“Expecting exception: NumberFormatException”).
However, these two patches are still assessed as overfitting because the exceptions are thrown from different functions from the ground truth program. Per the introduction of behavioral difference \(D_{exc\_loc}\) in Table 3, exceptions thrown by different functions justify an overfitting assessment.
RGT assessment yields two false positives when verifying exceptions thrown positions. This suggests that category \(D_{exc\_loc}\) may be skipped for RGT, which is easy to adjust by configuring corresponding options in test generators.
OPTIM
The patch1-Math-93-ACS is assessed as an overfitting patch by RGTRandoop2019 tests because they detect behavioral differences of Dassert. Bug Math-93 deals with computing a value based on logarithms. The fix from ACS uses lnn!, which is mathematically equivalent to the human-written solution \(\sum ln^{n}\). Their behavior should be semantically equivalent. However, the human-written patch introduces optimization for calculating \(\sum ln^{n}\) when n is less than 20 by returning a precalculated value. For instance, one of the sampled input is n = 10, the expected value from the ground truth patch is 15.104412573075516d (looked up in a list of hard-coded results), however, the actual value of patch1-Math-93-ACS is 15.104412573075518d. Thus, an assertion failure is caused and RGT classifies this patch as an overfitting patch because of such behavioral difference in output value. This false positive case would have been avoided if no optimization was introduced in the human-written patch that was taken as a ground truth.
Our finding warns the reproducible bug benchmark work (e.g., Madeiral et al. 2019; Benton et al. 2019) should pay additional attention to distinguishing the optimization code from the repair code in the human-written (reference) patches (#F5).
IMPERFECT
Two cases are falsely classified as overfitting due to the imperfection of human-written patches. They both cause the behavioral difference category Dexc2 that no exception is expected from a ground truth program while exceptions are thrown from a patched program. The patch1-Chart-5-Arja throws a null pointer exception because the variable item is null when executing RGT tests. The code snippet is given at line 595 of Listing 4. The human-written patch returns earlier, before executing the problematic code snippet, while the fix by patch1-Chart-5-Arja is later in the execution flow. Hence, an exception is thrown from patch1-Chart-5-Arja but not from the human-written patch for the illegal input. Another patch of patch1-Math-86-Arja can actually be considered better than the human-written patch because it is able to signal the illegal value NAN by throwing an exception while the human-written patch silently ignores the error (#F5).
Is the human written patch a perfect ground truth? RGT and related techniques are based on the assumption that the human-written patches are fully correct. Thus, when a test case differentiates the behavior between an APR patch and a human-written patch, the APR patch is considered as overfitting. The experimental results we have presented confirm that human-written patches are not perfect. Our findings confirm that the human patch itself may be problematic (Gu et al. 2010; Yin et al. 2011). However, we are the first to reveal how the imperfection of human patches impacts automatic patch correctness assessment. Beyond that, as shown in this section, optimization introduced at the same commit of bug fixing and other limitations influence overfitting patch identification of RGT assessment.
Result of RQ3: Effectiveness of RGT Assessment Compared to DiffTGen
The Effectiveness of RGT Assessment
We have executed 30 runs of DiffTGen over Dcorrect. DiffTGen identifies 2 patches as overfitting, which were both misclassified as correct (patch2-Lang-51-Jaid and patch1-Math-73-JGenProg2015). Recall that RGT patch assessment identifies in total 10 misclassified patches, including the 2 mentioned patches found by DiffTGen. This shows that RGT is more effective than DiffTGen.
Per the core algorithm of DiffTGen and its implementation, DiffTGen can only handle category Dassert of behavioral difference (value difference in the assertion). However, DiffTGen fails to identify another two misclassified patches also found by RGT of Dassert category: patch1-Lang-58-Nopol2015 and patch1-Lang-41-SimFix. Because DiffTGen fails to sample an input that differentiates the instrumented buggy and human-written patched programs, while our RGT assessment does not require those instrumented programs.
Further, we have performed 30 executions of RGT tests and DiffTGen over the whole 381 patches from Doverfitting. RGTEvosuite2019 yields 7,923 test failures and RGTRandoop2019 yields 65,819 test failures. Specifically, RGTEvosuite2019 identifies 248 overfitting patches and RGTRandoop2019 identifies 118 overfitting patches, and together they identify 274 overfitting patches (#F6). DiffTGen identifies 143/381 overfitting patches. Our experiment provides two implications: (1) RGT patch assessment improves over DiffTGen, and (2) For RGT patch assessment, Evosuite outperforms Randoop in sampling inputs that differentiate program behaviors by 210% (248/118), but consider these two techniques together can maximize the effectiveness of overfitting patches identification (#F7).
Figure 2 shows the number of overfitting patches in Doverfitting dataset identified by RGT assessment and DiffTGen. RGT gives better results than DiffTGen for all Defects4J projects. An outlier case is the Closure project, for which we see that the assessment effectiveness is low, both for RGT (9/37) and for DiffTGen (0/37). This is consistent with the results as shown in Fig. 1: RGT tests generated for the Closure project have the lowest coverage. As a result, the sampled RGT tests are less effective in exposing behavioral differences in the Closure project.
On Patch Assessment and Code Coverage
Figure 3 compares the code coverage obtained by the RGT test suites that detect overfitting patches and against the ones that do not detect overfitting patches. It shows that the test suites that detect overfitting patches have higher code coverage. Indeed, the average code coverage is 84% for tests that detect overfitting patches and 51% for the rest. In addition, we conduct a Mann-Whitney U test (Mann and Whitney 1947) to confirm that the difference between these two categories is significant, which is the case, the p-value is lower than 0.001. This shows that the RGT tests with higher test coverage are more likely to expose program behavioral differences and to detect overfitting patches for program repair.
On the Difference Between RGT and DiffTGen
Figure 4 shows the proportion of behavioral differences detected by RGT tests and DiffTGen per the taxonomy presented in Table 3. The proportions are computed over 7,923 test failures of RGTEvosuite2019, 65,819 test failures of RGTRandoop2019, and 143 behavioral differences detected by DiffTGen. RGTEvosuite2019 (top horizontal bar) detects six categories of behavioral differences and RGTRandoop2019 detects five categories. DiffTGen is only able to detect behavioral differences due to assertion failure between expected and actual values. For example, DiffTGen fails to produce a result for the two Lang-43 patches shown in Listing 2. The reason is that these two patches cause a Java heap space error, thus no values are produced for comparison in DiffTGen. On the contrary, RGT works on these cases, it can successfully compare the behavioral difference and detect these two overfitting patches.
In all cases, we see that assertion failure is the most effective category to detect behavioral differences of overfitting patches. Moreover, exceptions are also effective to detect behavioral differences, and this is the key factor for RGT’s effectiveness over DiffTGen (#F8). Notably, the two considered test generators are not equally good at generating exceptional cases, e.g., 31.9% of RGTEvosuite2019 failing tests expose differences of category Dexc1 while only 2.8% of RGTRandoop2019 tests do so. Similarly, we note that Randoop does not support exception assertions based on the thrown location (0% of \(D_{exc\_loc}\)).
Recall that both DiffTGen and RGTEvosuite2019 leverage Evosuite for test case generation, now we explain how we obtain those differences based on the configuration difference. We present the Evosuite configurations in Table 7. The first column gives the parameter to configure Evosuite, the second and third columns show the value set for such a parameter by DiffTGen and RGT respectively, and we explain the meaning of such a parameter in the last column for clarification. All other parameters are set to their default value.
Table 7 Configerations of Evosuite in DiffTGen and RGTEvosuite2019 As shown in Table 7, both DiffTGen and RGTEvosuite2019 configure the search criterion to branch coverage to guide Evosuite to generate tests, i.e., it maximizes branch coverage. The second row indicates that they both execute Evosuite 30 trials with 30 different random seeds. DiffTGen considers 60 s for the search budget (the best configuration of DiffTGen reported in Xin and Reiss 2017a). RGTEvosuite2019 considers 100 s for the searching budget which is heuristically the best value for RGT we identified in our experiments. DiffTGen does not configure a timeout for executing the body of a single test. On the contrary, RGT configures a timeout to bound the experimental time. As shown in Fig. 4, no overfitting patch is identified by RGTEvosuite2019 with the timeout difference (e.g., Dtimeout). In other words, the timeout difference setting has no influence on the experiment results. Thus, the experimental evaluation can be considered fair.
DiffTGen and RGTEvosuite2019 differ in one key parameter: assertion generation. DiffTGen configures the assertion as false in Evosuite because it does not compare the behavior based on the oracles generated by Evosuite but based on the variables observed via monitoring with code instrumentation. Recall that to determine a patch’s correctness, DiffTGen compares the values of instrumented variables between the patched version and the human-written version. On the contrary, RGT fully leverages the oracles (i.e., assertions and exceptions) generated by Evosuite based on the human-written version. In summary, DiffTGen and RGTEvosuite2019 use the same search criterion, random seeds, and close search budget to guide Evosuite for test generation. This key difference between DiffTGen and RGTEvosuite2019 is the generation of assertion: RGTEvosuite2019 uses Evosuite to generate executable test cases with oracles while DiffTGen only considers differences in internal variables.
We also compared the ability of DiffTGen and RGTEvsouite2019 to capture the output differences. DiffTGen captures the results of the execution of each statement (if any) and then compares, for each statement, the result obtained from the human-written patch and that from the machine patch (per the design of DiffTGen, these oracles are usually manually constructed)). Due to its design, DiffTGen requires the compare values that are present in both the human-written patch and the machine patch. In our experiments, DiffTGen fails to capture all output differences for two reasons: 1) there are no instrumented output values available, or 2) the output values are not comparable. For example, DiffTGen fails to capture 16 overfitting patches generated for bug Chart-1, because neither the faulty program line nor the patched program line is a value line, and thus no output values are captured. On the contrary, RGTEvsouite2019 tests consider all possible variables in generated assertions. RGTEvsouite2019 captures more behavioral differences by exploring all possible variables as well as more properties of those variables.
We recapitulate the main novelty and advantages of RGT compared to DiffTGen. First, RGT provides reusable tests that can be executed in a lightweight manner on any machine patches. On the contrary, in DiffTGen, all tests are generated based on an instrumented patched program, and these tests are coupled with the specific instrumented variables. Thus, the generated tests of DiffTGen are not reusable for future research. Second, RGT is a fully automated technique while DiffTGen requires manual work to identify change-related statements in the patched version and the human-written version (this has also been noted by Xiong et al. 2018).
Now we compare our findings against those of the close related work by Le et al. (2019). First of all, both experiments find that the performance of DiffTGen and Randoop for detecting overfitting patches is similar. Since, our experiment is done on a new and bigger benchmark (381 versus 135 patches overfitting patches), this significantly increases the external validity of this finding.Footnote 6 Second, the key novelty of our experiment is that we consider Evosuite which is not used in Le et al. (2019). In our experiment, DiffTGen and Randoop respectively achieve the effectiveness of 37.5%, 31%, while Evosuite reaches 65.1%. This is a major result compared to Le et al. (2019: it shows that automated patch assessment is actually effective, which is essential for future progress in program repair. Finally, we suggest that different test generation tools can be used in combination, which is a pragmatic approach for practitioners: our study shows that Evosuite and Randoop put together in RGT can achieve a 72% effectiveness in identifying overfitting patches.
Result of RQ4: Time Cost of RGT Patch Assessment
Table 8 summarizes the time cost of RGT patch assessment. The first column gives the breakdown of time cost as explained in Section 3.5. The second and third columns give the cost for the RGT tests we have generated for this study, while the fourth to sixth columns are the three categories of RGT tests generated in previous research projects shared by their respective authors. TCGen time is not available for the previously generated RGT tests. They were reported by their authors but it is not our goal, thus we put a ‘–’ in the corresponding cells. For example, the second column indicates RGTEvosuite2019 required 136.3 h for test case generation, 2.9 h for performing the sanity check, and 6.2 h for assessing the correctness of patches in Dcorrect dataset and 9.1 h in Doverfitting dataset.
Table 8 Time cost of RGT patch assessment We observe that TCGen is the dominant time cost of RGT patch assessment. RGTEvosuite2019 and RGTRandoop2019 respectively spend 136.3/154.5 h (88.2%) and 109.7/125.1 h (87.7%) on test generation (#F10). For assessing 638 patches using newly generated RGT tests, we need 14.5 min and 11.76 min per patch for Evosuite and Randoop respectively.
The three sets of previously generated RGT tests require 5.2, 15.3 and 5.1 h in accessing patch correctness for Dcorrect and Doverfitting dataset. Our experiment presents reusing tests from previous research is a significant time saver. For assessing 638 patches using previously generated RGT tests, the assessment time is 2.4 min per patch on average.
Note that the execution time of RGTEvosuiteASE15 is less than RGTEvosuite2019. This is because RGTEvosuiteASE15 contains only 10 runs of test generation but RGTEvosuite2019 contains 30 runs. With the same number of test generation configurations, RGTEvosuiteEMSE18 goes faster than RGTEvosuite2019, because it only contains tests for 42 bugs.
Now we take a look at the effectiveness of RGT tests from previous research. RGT tests generated from previous research identifies 9 out of 10 misclassified patches from Dcorrect (the missing one is patch1-lang-35-ACS). From Doverfitting, a total of 219 overfitting patches are found by the three previously generated RGT tests together (#F11). Recall that RGTEvosuite2019 and RGTRandoop2019 together identify 274 overfitting patches for Doverfitting. Despite a fewer number of tests, RGT tests from previous research achieve 80% (219/274) effectiveness compared to our newly generated RGT tests. Therefore, RGT tests generated from previous research are considered effective and efficient for patch correctness assessment usage.
Regression test generation is known to be costly and indeed, over 87% of the time cost in our experiment is spent in test generation. Consequently, reusing previously generated RGT test cases is a significant time-saver for patch assessment. By sharing a curated dataset of 4 million generated RGT tests, we save 246 computation hours for future researchers (not counting the associated time such as configuration, cluster management, etc). More importantly, reusing tests is essential for the scientific community: when experiments and papers are based on the same set of generated tests, the results can be reliably compared one against the others. Consequently, our replication dataset helps the community to achieve well-founded results.
Result of RQ5: Trade-Off Between Test Generation and Effectiveness of RGT Assessment
Figure 5 shows the number of overfitting patches detected depending on the number of generated test suites. The X-axis shows the number of test suites generated with a different seed, the Y-axis is the average number of detected overfitting patches over 1000 random groups sampled from Doverfitting. Recall that the best results, obtained after all runs, are that RGTEvosuite2019 and RGTRandoop2019 identify 248 and 148 overfitting patches individually from Doverfitting.
For both RGTEvosuite and RGTRandoop2019, the more test generation runs, the better the effectiveness of RGT patch assessment. Nevertheless, even with a small number of test generations, e.g., 5 runs, RGT is able to achieve more than 80% of effectiveness. On average, 25 runs of RGTRandoop2019 is able to achieve the same performance as 30 runs of Randoop. On the contrary, RGTEvosuite keeps identifying more overfitting patches, even after 25 runs. Due to the computational costs of this experiment, it is left to future work to identify when a plateau appears for RGTEvosuite. We observe that after 10 test suites of RGTEvosuite2019, the newly identified overfitting patches still increase but do not largely vary. Thus, a pragmatic rule of thumb is to do 10 test generations. For RGTRandoop2019, the number of overfitting patches identified by different numbers of test generation is considerably close. In our experiment, we observe after 15 test suites in RGTRandoop2019, the curve starts to remain steady. Thus, a pragmatic rule of thumb is to do 15 test generations in Randoop that is equivalent to 93% effectiveness of overfitting patch classification.