An Empirical Study of Fault Localization in Python Programs

Despite its massive popularity as a programming language, especially in novel domains like data science programs, there is comparatively little research about fault localization that targets Python. Even though it is plausible that several findings about programming languages like C/C++ and Java -- the most common choices for fault localization research -- carry over to other languages, whether the dynamic nature of Python and how the language is used in practice affect the capabilities of classic fault localization approaches remain open questions to investigate. This paper is the first multi-family large-scale empirical study of fault localization on real-world Python programs and faults. Using Zou et al.'s recent large-scale empirical study of fault localization in Java as the basis of our study, we investigated the effectiveness (i.e., localization accuracy), efficiency (i.e., runtime performance), and other features (e.g., different entity granularities) of seven well-known fault-localization techniques in four families (spectrum-based, mutation-based, predicate switching, and stack-trace based) on 135 faults from 13 open-source Python projects from the BugsInPy curated collection. The results replicate for Python several results known about Java, and shed light on whether Python's peculiarities affect the capabilities of fault localization. The replication package that accompanies this paper includes detailed data about our experiments, as well as the tool FauxPy that we implemented to conduct the study.


Introduction
It is commonplace that debugging is an activity that takes up a disproportionate amount of time and resources in software development [41].This also explains the popularity of fault localization as a research subject in software engineering: identifying locations in a program's source code that are implicated in some observed failures (such as crashes or other kinds of runtime errors) is a key step of debugging.This paper contributes to the empirical knowledge about the capabilities of fault localization techniques, targeting the Python programming language.
Despite the massive amount of work on fault localization (see Section 3) and the popularity of the Python programming language, 12 most empirical studies of fault localization target languages like Java or C.This leaves open the question of whether Python's characteristics-such as the fact that it is dynamically typed, or that it is dominant in certain application domains such as data science-affect the capabilities of classic fault localization techniques-developed and tested primarily on different kinds of languages and programs.
This paper fills this knowledge gap: to our knowledge, it is the first multi-family large-scale empirical study of fault localization in real-world Python programs.The starting point is Zou et al.'s recent extensive study [78] of fault localization for Java.This paper's main contribution is a differentiated conceptual replication [30] of Zou et al.'s study, sharing several of its features: i) it experiments with several different families (spectrum-based, mutation-based, predicate switching, and stack-trace-based) of fault localization techniques; ii) it targets a large number of faults in real-world projects (135 faults in 13 projects) ; iii) it studies fault localization effectiveness at different granularities (statement, function, and module); iv) it considers combinations of complementary fault localization techniques.The fundamental novelty of our replication is that it targets the Python programming language; furthermore, i) it analyzes fault localization effectiveness of different kinds of faults and different categories of projects; ii) it estimates the contributions of different fault localization features by means of regression statistical models; iii) it compares its main findings for Python to Zou et al.'s [78] for Java.
The main findings of our Python fault localization study are as follows: 1. Spectrum-based fault localization techniques are the most effective, followed by mutation-based fault localization techniques.2. Predicate switching and stack-trace fault localization are considerably less effective, but they can work well on small sets of faults that match their characteristics.3. Stack-trace is by far the fastest fault localization technique, predicate switching and mutation-based fault localization techniques are the most time consuming.4. Bugs in data-science related projects tend to be harder to localize than those in other categories of projects.5. Combining fault localization techniques boosts their effectiveness with only a modest hit on efficiency.6.The main findings about relative effectiveness still hold at all granularity levels.7. Most of Zou et al. [78]'s findings about fault localization in Java carry over to Python.
A practical challenge to carry out a large-scale fault localization study of Python projects was that, at the time of writing, there were no open-source tools that support a variety of fault localization techniques for this programming language.Thus, to perform this study, we implemented FAUXPY: a fault-localization tool for Python that supports seven fault localization techniques in four families, is highly configurable, and works with the most common Python unit testing frameworks (such as Pytest and Unittest).The present paper is not a paper about FAUXPY, which we plan to present in detail in a separate publication.Nevertheless, we briefly discuss the key features of FAUXPY, and make the tool available as part of this paper's replication package-which also includes all the detailed experimental artifacts and data that support further independent analysis and replicability.
The rest of the paper is organized as follows.Section 2 presents the fault localization techniques that fall within the scope of the empirical study, and outlines FAUXPY's features.Section 3 summarizes the most relevant related work in fault localization, demonstrating how Python is underrepresented in this area.Section 4 presents in detail the paper's research questions, and the experimental methodology that we followed to answer them.Section 5 details the experimental results for each investigated research question, and presents any limitations and threats to the validity of the findings.Section 6 concludes with a high-level discussion of the main results, and of possible avenues for future work.
Replication package.For reproducibility, all experimental artifacts of this paper's empirical study, and the implementation of the FAUXPY tool, are available: https://doi.org/10.6084/m9.figshare.23254688

Fault Localization and FAUXPY
Fault localization techniques [73,71] relate program failures (such as crashes or assertion violations) to faulty locations in the program's source code that are responsible for the failures.Concretely, a fault localization technique L assigns a suspiciousness score L T (e) to any program entity e-usually, a location, function, or module-given test inputs T that trigger a failure in the program.The suspiciousness score L T (e) should be higher the more likely e is the location of a fault that is ultimately responsible for the failure.Thus, a list of all program entities e 1 , e 2 , . . .ordered by decreasing suspiciousness score L T (e 1 ) ≥ L T (e 2 ) ≥ . . . is fault localization technique L's overall output.
Let T = P ∪ F be a set of tests partitioned into passing P and failing F, such that F ̸ = ∅-there is at least one failing test-and all failing tests originate in the same fault.Tests T and a program p are thus the target of a single fault localization run.Then, fault localization techniques differ in what kind of information they extract from T and p to compute suspiciousness scores.A fault localization family is a group of techniques that combine the same kind of information according to different formulas.Sections 2.1-2.4 describe four common FL families that comprise a total of seven FL families.As Section 2.5 further explains, a FL technique's granularity denotes the kind of program entities it analyzes for suspiciousness-from individual program locations to functions or files/modules.Some FL techniques are only defined for a certain granularity level, whereas others can be applied to different granularities.
While FL techniques are usually applicable to any programming language, we could not find any comprehensive implementation of the most common fault localization techniques for Python at the time of writing.Therefore, we implemented FAUXPY-an automated fault localization tool for Python implementing several widely used techniques-and used it to perform the empirical study described in the rest of the paper.Section 2.6 outlines FAUXPY's main features and some details of its implementation.Ochiai T (e) = F + (e) |F| × (F + (e) + P + (e)) (2) DStar T (e) = (F + (e)) 2 P + (e) + F − (e) Figure 1: SBFL formulas to compute the suspiciousness score of an entity e given tests T = P ∪ F partitioned into passing P and failing F. All formulas compute a score that is higher the more failing tests F + (e) cover e, and lower the more passing tests P + (e) cover e-capturing the basic heuristics of SBFL.
Metallaxis T (e) = max m∈M m mutates e Metallaxis T (m) (4) Muse T (m) (5) Figure 2: MBFL formulas to compute the suspiciousness score of a mutant m given tests T = P ∪ F partitioned into passing P and failing F. All formulas compute a score that is higher the more failing tests F k (m) kill m, and lower the more passing tests P k (m) kill m-capturing the basic heuristics of mutation analysis.On the right, MBFL formulas to compute the suspiciousness score of a program entity e by aggregating the suspiciousness score of all mutants m ∈ M that modified e in the original program.

Spectrum-Based Fault Localization
Techniques in the spectrum-based fault localization (SBFL) family compute suspiciousness scores based on a program's spectra [52]-in other words, its concrete execution traces.The key heuristics of SBFL techniques is that a program entity's suspiciousness is higher the more often the entity is covered (reached) by failing tests and the less often it is covered by passing tests.The various techniques in the SBFL family differ in what formula they use to assign suspiciousness scores based on an entity's coverage in passing and failing tests.
Given tests T = P ∪ F as above, and a program entity e: i) P + (e) is the number of passing tests that cover e; ii) P − (e) is the number of passing tests that do not cover e; iii) F + (e) is the number of failing tests that cover e; iv) and F − (e) is the number of failing tests that do not cover e. Figure 1 shows how Tarantula [26], Ochiai [1], and DStar [70]-three widely used SBFL techniques [49]-compute suspiciousness scores given this coverage information.DStar's formula (3), in particular, takes the second power of the numerator, as recommended by other empirical studies [78,70]. 3

Mutation-Based Fault Localization
Techniques in the mutation-based fault localization (MBFL) family compute suspiciousness scores based on mutation analysis [25], which generates many mutants of a program p by applying random transformations to it (for example, change a comparison operator < to ≤ in an expression).A mutant m of p is thus a variant of p whose behavior differs from p's at, or after, the location where m differs from p.The key idea of mutation analysis is to collect information about p's runtime behavior based on how it differs from its mutants'.Accordingly, when a test t behaves differently on p than on m (for example, p passes t but m fails it), we say that t kills m.
To perform fault localization on a program p, MBFL techniques first generate a large number of mutants M = {m 1 , m 2 , . ..} of p by systematically applying each mutation operator to each statement in p that is executed in any failing test F.Then, given tests T = P ∪ F as above, and a mutant m ∈ M: i) P k (m) is the number of tests that p passes but m fails (that is, the tests in P that kill m); ii) F k (m) is the number of tests that p fails but m passes (that is, the tests in F that kill m); iii) and F k ∼ (m) is the number of tests that p fails and behave differently on m, either because they pass on m or because they still fail but lead to a different stack trace (this is a weaker notion of tests that kill m [45]).Figure 2 shows how Metallaxis [45] and Muse [42]-two widely used MBFL techniques-compute suspiciousness scores of each mutant in M. Metallaxis's formula (4) is formally equivalent to Ochiai's-except that it is computed for each mutant and measures killing tests instead of covering tests.In Muse's formula (5), ∑ n∈M F k (n) is the total number of failing tests in F that kill any mutant in M, and ∑ n∈M P k (n) is the total number of passing tests in P that kill any mutant in M (these are called f2p and p2f in Muse's paper [42]).
Finally, MBFL computes a suspiciousness score for a program entity e by aggregating the suspiciousness scores of all mutants that modified e in the original program p; when this is the case, we say that a mutant m mutates e.The right-hand side of Figure 2 shows Metallaxis's and Muse's suspiciousness formulas for entities: Metallaxis (4) takes the largest (maximum) mutant score, whereas Muse (5) takes the average (mean) of the mutant scores.

Predicate Switching
The predicate switching (PS) [74] 4 For example, the most suspicious program entity e will be the location of the last critical predicate evaluated before any test failure.
PS has some distinctive features compared to other FL techniques.First, it only uses failing tests for its dynamic analysis; any passing tests P are ignored.Second, the only program entities it can report as suspicious are locations of predicates; thus, it usually reports a shorter list of suspicious locations than SBFL and MBFL techniques.Third, while MBFL mutates program code, PS dynamically mutates individual program executions.For example, suppose that a loop while c:B executes its body B twice-and hence, the loop condition c is evaluated three times-in a failing test.Then, PS will generate three variants of this test execution: i) one where the loop body never executes (c is false the first time it is evaluated); ii) one where the loop body executes once (c is false the second time it is evaluated); iii) one where the loop body executes three or more times (c is true the third time it is evaluated).

Stack Trace Fault Localization
When a program execution fails with a crash (for example, an uncaught exception), the language runtime usually prints its stack trace (the chain of methods active when the crash occurred) as debugging information to the user.In fact, it is known that stack trace information helps developers debug failing programs [5]; and a bug is more likely to be fixed if it is close to the top of a stack trace [59].Based on these empirical findings, Zou et al. [78] proposed the stack trace fault localization technique (ST), which uses the simple heuristics of assigning suspiciousness based on how close a program entity is to the top of a stack trace.
Concretely, given a failing test t ∈ F, its stack trace is a sequence f 1 f 2 . . . of the stack frames of all functions that were executing when t terminated with a failure, listed in reverse order of execution; thus, f 1 is the most recently called function, which was directly called by f 2 , and so on.ST assigns a (positive) suspiciousness score to any program entity e that belongs to any function f k in t's stack trace: ST t (e) = 1/k, so that e's suspiciousness is higher, the closer to the failure e's function was called. 5In particular, the most suspicious program entities will be all those in the function f 1 called in the top stack frame.Then, the overall suspiciousness score of e is the maximum in all failing tests F: ST F (e) = max t∈F ST t (e).

Granularities
Fault localization granularity refers to the kinds of program entity that a FL technique ranks.The most widely studied granularity is statement-level, where each statement in a program may receive a different suspiciousness score [49,70].However, coarser granularities have also been considered, such as function-level (also called methodlevel) [3,72] and module-level (also called file-level) [55,77].
In practice, implementations of FL techniques that support different levels of granularity focus on the finest granularity (usually, statement-level granularity), whose information they use to perform FL at coarser granularities.Namely, the suspiciousness of a function is the maximum suspiciousness of any statements in its definition; and the suspiciousness of a module is the maximum suspiciousness of any functions belonging to it. 6.6 FAUXPY: Features and Implementation Despite its popularity as a programming language, we could not find off-the-shelf implementations of fault localization techniques for Python at the time of writing [57].The only exception is CharmFL [21]-a plugin for the PyCharm IDE-which only implements SBFL techniques.Therefore, to conduct an extensive empirical study of FL in Python, we implemented FAUXPY: a fault localization tool for Python programs.
FAUXPY supports all seven FL techniques described in Sections 2.1-2.4; it can localize faults at the level of statements, functions, or modules (Section 2.5).To make FAUXPY a flexible and extensible tool, easy to use with a variety of other commonly used Python development tools, we implemented it as a stand-alone command-line tool that works with tests in the formats supported by Pytest, Unittest, and Hypothesis [40]-three popular Python testing frameworks.
While running, FAUXPY stores intermediate analysis data in an SQLite database; upon completing a FL localization run, it returns to the user a human-readable summary-including suspiciousness scores and ranking of program entities.The database improves performance (for example by caching intermediate results) but also facilitates incremental analyses-for example, where we provide different batches of tests in different runs.
FAUXPY's implementation uses Coverage.py[4]-a popular code-coverage measurement library-to collect the execution traces needed for SBFL and MBFL.It also uses the state-of-the-art mutation-testing framework Cosmic Ray [8] to generate mutants for MBFL; since Cosmic Ray is easily configurable to use some or all of its mutation operators-or even to add new user-defined mutation operators-FAUXPY's MBFL implementation is also fully configurable.To implement PS in FAUXPY, we developed an instrumentation library that can selectively change the runtime value of predicates in different runs as required by the PS technique.The implementation of FAUXPY is available as open-source (see this paper's replication package).

Related Work
Fault localization has been an intensely researched topic for over two decades, whose popularity does not seem to wane [71].This section summarizes a selection of studies that are directly relevant for the paper; Wong's recent survey [71] provides a broader summary for interested readers.
Spectrum-based fault localization.The Tarantula SBFL technique [26] was one of the earliest, most influential FL techniques, also thanks to its empirical evaluation showing it is more effective than other competing techniques [51,7].The Ochiai SBFL technique [1] improved over Tarantula, and it often still is considered the "standard" SBFL technique.
These earlier empirical studies [26,1], as well as other contemporary and later studies of FL [45], used the Siemens suite [20]: a set of seven small C programs with seeded bugs.Since then, the scale and realism of FL empirical studies has significantly improved over the years, targeting real-world bugs affecting projects of realistic size.For example, Ochiai's effectiveness was confirmed [33] on a collection of more realistic C and Java programs [12].When Wong et al. [70] proposed DStar, a new SBFL technique, they demonstrated its capabilities in a sweeping comparison involving 38 other SBFL techniques (including the "classic" Tarantula and Ochiai).In contrast, numerous empirical results about fault localization in Java based on experiments with artificial faults were found not to hold to experiments with real-world faults [49] using the Defects4J curated collection [28].
Mutation-based fault localization.With the introduction of novel fault localization families-most notably, MBFLempirical comparison of techniques belonging to different families became more common [42,45,49,78].The Muse MBFL technique was introduced to overcome a specific limitation of SBFL techniques: the so-called "tie set problem".This occurs when SBFL assigns the same suspiciousness score to different program entities, simply because they belong to the same simple control-flow block (see Section 2.1 for details on how SBFL works).Metallaxis-FL [45] (which we simply call "Metallaxis" in this paper) is another take on MBFL that can improve over SBFL techniques.
The comparison between MBFL and SBFL is especially delicate given how MBFL works.As demonstrated by Pearson et al. [49], MBFL's effectiveness crucially depends on whether it is applied to bugs that are "similar" to those introduced by its mutation operators.This explains why the MBFL studies targeting artificially seeded faults [42,45] found MBFL to outperform SBFL; whereas studies targeting real-world faults [49,78] found the opposite to be the case-a result also confirmed by the present paper in Section 5.1.
Mutation testing.MBFL techniques rely on mutation testing to generate mutants of a faulty program that may help locate the fault.Therefore, the selection of mutation operators that are used for mutation testing impacts the effectiveness of MBFL techniques.Research in mutation testing has grown considerably in the last decade, developing a large variety of mutation operators tailored to specific programming languages, applications, and faults [46].Despite these recent developments, the fundamental set of mutation operators introduced in Offut et al.'s seminal work [44] remains the basis of basically every application to mutation testing.These fundamental operators generate mutants by modifying or removing arithmetic, logical, and relational operators, as well as constants and variables in a program, and hence are widely applicable and domain-agnostic.Notably, the Cosmic Ray [8] Python mutation testing framework (used in our implementation of FAUXPY), the two other popular Python mutation testing frameworks MutPy [11] and mutmut, 7 as well as the popular Java mutation testing frameworks Pitest 8 , MuJava [39] and Major [27] (the latter used in Zou et al.'s MBFL experiments [78]) all offer Offut et al.'s fundamental operators.This helps make experiments with mutation testing techniques meaningfully comparable.
Empirical comparisons.This paper's study design is based on Zou et al.'s empirical comparison of fault localization on Java programs [78].We chose their study because it is fairly recent (it was published in 2021), it is comprehensive (it targets 11 fault localization techniques in seven families, as well as combinations of some of these techniques), and it targets realistic programs and faults (357 bugs in five projects from the Defects4J curated collection).
Ours is a differentiated conceptual replication [30] of Zou et al.'s study [78].We target a comparable number of subjects (135 BUGSINPY [67] bugs vs. 357 Defects4J [28] bugs) from a wide selection of projects (13 real-world Python projects vs. five real-world Java projects).We study [78]'s four main fault localization families SBFL, MBFL, PS, and ST, but we exclude three other families that featured in their study: DS (dynamic slicing [18]), IRBFL (Information retrieval-based fault localization [77]), and HBFL (history-based fault localization [50]).IRBFL and HBFL were shown to be scarcely effective by Zou et al. [77], and rely on different kinds of artifacts that may not always be available when dynamically analyzing a program as done by the other "mainstream" fault localization techniques.Namely, IRBFL analyzes bug reports, which may not be available for all bugs; HBFL mines commit histories of programs.In contrast, our study only includes techniques that solely rely on tests to perform fault localization; this help make a comparison between techniques consistent.Finally, we excluded DS for practical reasons: implementing it requires accurate data-and control-dependency static analyses [73].These are available in languages like Java through widely used frameworks like Soot [64,32]; in contrast, Python currently offers few mature static analysis tools (e.g, Scalpel [34]), none with the features required to implement DS.Unfortunately, dynamic slicing has been implemented for Python in the past [6] but no implementation is publicly available; and building it from scratch is outside the present paper's scope.
Python fault localization.Despite Python's popularity as a programming language, the vast majority of fault localization empirical studies target other languages-mostly C, C++, and Java.To our knowledge, CharmFL [63,21] is the only available implementation of fault localization techniques for Python; the tool is limited to SBFL techniques.We could not find any realistic-size empirical study of fault localization using Python programs comparing techniques of different families.This gap in both the availability of tools [57] and the empirical knowledge about fault localization in Python motivated the present work.
Note that numerous recent empirical studies looked into fault localization for deep-learning models implemented in Python [13,17,76,75,58,65].This is a very different problem, using very different techniques, than "classic" program-based fault localization, which is the topic of our paper.
Deep learning-based fault localization.Deep learning models have recently been applied to the software fault localization problem.The key idea of techniques such as DeepFL [36], GRACE [38], and DEEPRL4FL [37] is to train a deep learning model to identify suspicious locations, giving it as input coverage information, as well as other encoded information about the source code of the faulty programs (such as the data and control-flow dependencies).While these approaches are promising, we could not include them in our empirical study since they do not have the same level of maturity as the other "classic" FL techniques we considered.First, DeepFL and GRACE only work at function-level granularity, whereas the bulk of FL research targets statement-level granularity.Second, there are no reference implementations of techniques such as DEEPRL4FL that we can use for our experiments. 9hird, the performance of a deep learning-based technique usually depends on the training set.Fourth, training a deep learning model is usually a time consuming process; how to account for this overhead when comparing efficiency is tricky.
Nevertheless, our empirical study does feature one FL technique that is based on machine learning: CombineFL, which is Zou et al.'s application of learning to rank to fault localization [78].The same paper also discusses how CombineFL outperforms other state-of-the-art machine learning-based fault localization techniques such as MUL-TRIC [35], Savant [3], TraPT [35], and FLUCCS [61].Therefore, CombineFL is a valid representative of the capabilities of pre-deep learning machine learning FL techniques.
Python vs. Java SBFL comparison.To our knowledge, Widyasari et al.'s recent empirical study of spectrum-based fault localization [68] is the only currently available large-scale study targeting real-world Python projects.Like our work, they use the bugs in the BUGSINPY curated collection as experimental subjects [67]; and they compare their results to those obtained by others for Java [49].Besides these high-level similarities, the scopes of our study and Widyasari et al.'s are fundamentally different: i) We are especially interested in comparing fault localization techniques in different families; they consider exclusively five spectrum-based techniques, and drill down into the relative performance of these techniques.ii) Accordingly, we consider orthogonal categorization of bugs: we classify bugs (see Section 4.3) according to characteristics that match the capabilities of different fault-localization families (e.g., stack-trace fault localization works for bugs that result in a crash); they classify bugs according to syntactic characteristics (e.g., multi-line vs. single-line patch).iii) Most important, even though both our paper and Widyasari et al.'s compare Python to Java, the framing of our comparisons is quite different: in Section 5.6, we compare our findings about fault localization in Python to Zou et al. [78]'s findings about fault localization in Java; for example, we confirm that SBFL techniques are generally more effective than MBFL techniques in Python, as they were found to be in Java.In contrast, Widyasari et al. directly compare various SBFL effectiveness metrics they collected on Python programs against the same metrics Pearson et al. [49] collected on Java programs; for example, Widyasari et al. report that the percentage of bugs in BUGSINPY that their implementation of the Ochiai SBFL technique correctly localized within the top-5 positions is considerably lower than the percentage of bugs in Defects4J that Pearson et al.'s implementation of the Ochiai SBFL technique correctly localized within the top-5.
It is also important to note that there are several technical differences between ours and Widyasari et al.'s methodology.First, we handle ties between suspiciousness scores by computing the E inspect rank (described in Section 4.5); whereas they use average rank (as well as other effectiveness metrics).Even though we also take our subjects from BUGSINPY, we carefully selected a subset of bugs that are fully analyzable on our infrastructure with all fault localization techniques we consider (Section 4.1, Section 4.7); whereas they use all BUGSINPY available bugs.The selection of subjects is likely to impact the value of some metrics more than others (see Section 4.5); for example, the exam score is undefined for bugs that a fault localization technique cannot localize, whereas the top-k counts are lower the more faults cannot be localized.These and numerous other differences make our results and Widyasari et al.'s incomparable and mostly complementary.A replication of their comparison following our methodology is an interesting direction for future work, but clearly outside the present paper's scope.In Section 6.1 we present some additional data, and outline a few directions for future work that are directly inspired by Widyasari et al.'s study [68].

Experimental Design
Our experiments assess and compare the effectiveness and efficiency of the seven FL techniques described in Section 2, as well as of their combinations, on real-world Python programs and faults.To this end, we target the following research questions: RQ6 compares our overall results to Zou et al. [78]'s, exploring similarities and differences between Java and Python programs.

Subjects
To have a representative collection of realistic Python bugs, 10 we used BUGSINPY [67], a curated dataset of real bugs collected from real-world Python projects, with all the information needed to reproduce the bugs in controlled experiments.We classified the projects according to their description in their respective repositories, as well as how they are presented in BUGSINPY.Like any classification, the boundaries between categories may be somewhat fuzzy, but the main focus of most projects is quite obvious (such as DS for keras and pandas, or CL for youtube-dl).Bug selection.Despite BUGSINPY's careful curation, several of its bugs cannot be reproduced because their dependencies are missing or no longer available; this is a well-known problem that plagues reproducibility of experiments involving Python programs [43].In order to identify which BUGSINPY bugs were reproducible at the time of our experiments on our infrastructure, we took the following steps for each bug b: i) Using BUGSINPY's scripts, we generated and executed the faulty p − b version and checked that tests in F b fail whereas tests in P b pass on it; and we generated and executed the fixed p + b version and checked that all tests in F b ∪ P b pass on it.Out of all of BUGSINPY's bugs, 120 failed this step; we did not include them in our experiments.ii) Python projects often have two sets of dependencies (requirements): one for users and one for developers; both are needed to run fault localization experiments, which require to instrument the project code.Another 39 bugs in BUGSINPY miss some development dependencies; we did not include them in our experiments.iii) Two bugs resulted in an empty ground truth (Section 4.2): essentially, there is no way of localizing the fault in p − b ; we did not include these bugs in our experiments.This resulted in 501 − 120 − 39 − 2 = 340 bugs in 13 projects (all but ansible, matplotlib, PySnooper, and scrapy) that we could reproduce in our experiments.
However, this is still an impractically large number: just reproducing each of these bugs in BUGSINPY takes nearly a full week of running time, and each FL experiment may require to rerun the same tests several times (hundreds of times in the case of MBFL).Thus, we first discarded 27 bugs that each take more than 48 hours to reproduce.We estimate that including these 27 bugs in the experiments would have taken over 14 CPU-months just for the MBFL experiments-not counting other FL techniques, nor the time for setup and dealing with unexpected failures.
Running all the fault localization experiments for each of the remaining 313 = 340 − 27 bugs takes approximately eleven CPU-hours, for a total of nearly five CPU-months.We selected 135 bugs out of the 313 using stratified random sampling with the four project categories as the "strata", picking: 43 bugs in category CL, 30 [67], the project statistics reported here refer to the latest version of the projects on 2020-06-19.
worth of experiments.In all, we used this selection of 135 bugs as our empirical study's subjects.Table 2 gives some details about the selected projects and their bugs.As a side comment, note that our experiments with BUGSINPY were generally more time consuming than Zou et al.'s experiments with Defects4J.For example, the average per-bug running time of MBFL in our experiments (15 774 seconds in Table 6) was 3.3 times larger than in Zou et al.'s (4800 seconds in [78,Table 9]).Even more strikingly, running all fault localization experiments on the 357 Defects4J bugs took less than one CPU-month; 12 in contrast, running MBFL on just 27 "time consuming" bugs in BUGSINPY takes over 14 CPU-months.This difference may be partly due to the different characteristics of projects in Defects4J vs. BUGSINPY, and partly to the dynamic nature of Python (which is run by an interpreter).

Faulty Locations: Ground Truth
A fault localization technique's effectiveness measures how accurately the technique's list of suspicious entities matches the actual fault locations in a program-fault localization's ground truth.It is customary to use programmerwritten patches as ground truth [78,49]: the program locations modified by the patches that fix a certain bug correspond to the bug's actual fault locations.
Concretely, here is how to determine the ground truth of a bug First of all, ignore any blank or comment lines, since these do not affect a program's behavior and hence cannot be responsible for a fault.Then, finding the ground truth locations corresponding to removes and modifies is straightforward: a location ℓ that is removed or modified in p + b exists by definition also in p − b , and hence it is part of the ground truth.In Figure 3, line 10 is modified and line 17 is removed by the edit that transforms Figure 3a into Figure 3b; thus 10 and 17 are part of the example's ground truth.Finding the ground truth locations corresponding to adds is more involved [57], because a location ℓ that is added to p + b does not exist in p − b : b is a fault of omission [49]. 13A common solution [78,49] is to take as ground truth the location in p − b that immediately follows ℓ.In Figure 3, line 6 corresponds to the first non-blank line that follows the assignment statement that is added at line 22 in Figure 3b; thus 6 is part of the example's ground truth.However, an add at ℓ is actually a modification between two other locations; therefore, the location that immediately precedes ℓ should also be part of the ground truth, since it identifies the same insertion location.In Figure 3, line 1 precedes the assignment statement that is added at line 22 in Figure 3a; thus 1 is also part of the example's ground truth.
A location's scope poses a final complication to determine the ground truth of adds.Consider line 31, added in Figure 3b at the very end of function foo's body.The (non-blank, non-comment) location that follows it in Figure 3a is line 16; however, line 16 marks the beginning of another function bar's definition.Function bar cannot be the location of a fault in foo, since the two functions are independent-in fact, the fact that bar's declaration follows foo's is immaterial.Therefore, we only include a location in the ground truth if it is within the same scope as the location ℓ that has been added.If ℓ is part of a function body (including methods), its scope is the function declaration; if ℓ is part of a class outside any function (e.g., an attribute), its scope is the class declaration; and otherwise ℓ's scope is the module it belongs to.In Figure 3, both lines 1 and 6 are within the same module as the added statement at line 22 in Figure 3a.In contrast, line 16 is within a different scope than the added statement at line 31 in Figure 3a.Therefore, lines 1, 6, and 12 are part of the ground truth, but not line 16.
Our definition of ground truth refines that used in related work [78,49] by including the location that precedes an add, and by considering only locations within scope.We found that this definition better captures the programmer's intent and their corrective impact on a program's behavior.
How to best characterize bugs of omissions (fixed by an add) in fault localization remains an open issue [57].Pearson et al.'s study [49] proposed the first viable solution: including the location following an add.Zou et al. [78] followed the same approach, and hence we also include the location following an add in our ground truth computation.We also noticed that, by also including the location preceding an add, and by taking scope into account, our ground truth computation becomes more comprehensive; in particular, it also works for statements added at the very end of a file-a location that has no following lines.While our approach is usually more precise, it is not necessarily the preferable alternative in all cases.Consider again, for instance, the add at line 31 in Figure 3; if we ignored the scope (and the preceding statement), only line 16 would be included in its ground truth.If this fault localization information were consumed by a developer, it could still be useful and actionable even if it reports a line outside the scope of the actual add location: the developer would use the location as a starting point for their inspection of the nearby code; and they may prefer a smaller, if slightly imprecise, ground truth to a larger, redundant one.However, this paper's focus is strictly evaluating the effectiveness of FL techniques as rigorously as possible-for which our stricter ground truth computation is more appropriate.The notion of crashing and predicate bugs is from Zou et al. [78].
We introduced the notion of mutable bug to try to capture scenarios where MBFL techniques have a fighting chance to correctly localize bugs.Since MBFL uses mutant analysis for fault localization, its capabilities depend on the mutation operators that are used to generate the mutants.Therefore, the notion of mutable bugs is somewhat dependent on the applied mutation operators. 14Our implementation of FAUXPY uses the standard operators offered by the popular Python mutation testing framework Cosmic Ray [8].As we discussed in Section 3, Cosmic Ray features a set of mutation operators that are largely similar to several other general-purpose mutation testing frameworks-all based on Offut et al.'s well known work [44].These strong similarities between the mutation operators offered by most widely used mutation testing frameworks suggest that our definition of "mutable bug" is not strongly dependent on the specific mutation testing framework that is used.Correspondingly, bugs that we classify as "mutable" are likely to remain amenable to localization with MBFL provided one uses (at least) this standard set of core mutation operators.Conversely, we expect that devising new, specialized mutation operators may extend the number of bugs that we can classify as "mutable", and hence that are more likely to be amenable to localization with MBFL techniques.
Figure 4 shows the kind of the 135 BUGSINPY bugs we used in the experiments, consisting of 49 crashing bugs, 52 predicate bugs, 74 mutable bugs, and 34 bugs that do not belong to any of these categories.
Project category.Another, orthogonal classification of bugs is according to the project category they belong to.We classify a bug b as a CL, DEV, DS, or WEB bug according to the category of project (Table 2) b belongs to.

Ranking Program Entities
Running a fault localization technique L on a bug b returns a list of program entities ℓ 1 , ℓ 2 , . .., sorted by their decreasing suspiciousness scores s 1 ≥ s 2 ≥ . ... The programmer (or, more realistically, a tool [48,16]) will go through the entities in this order until a faulty entity (that is an ℓ ∈ F (b) that matches b's ground truth) is found.In this idealized process, the earlier a faulty entity appears in the list, the less time the programmer will spend going through the list, the more effective fault localization technique L is on bug b.Thus, a program entity's rank in the sorted list of suspicious entities is a key measure of fault localization effectiveness.
Computing a program entity ℓ's rank is trivial if there are no ties between scores.For example, consider Table 3's first two program entities ℓ 1 and ℓ 2 , with suspiciousness scores s 1 = 10 and s 2 = 7. Obviously, ℓ 1 's rank is 1 and ℓ 2 's is 2; since ℓ 2 is faulty (ℓ 2 ∈ F (b)), its rank is also a measure of how many entities will need to be inspected in the aforementioned debugging process.
When several program entities tie the same suspiciousness score, their relative order in a ranking is immaterial [10].Thus, it is a common practice to give all of them the same average rank [57,62], capturing an average-case number of program entities inspected while going through the fault localization output list.For example, consider Table 3's first five program entities ℓ 1 , . . ., ℓ 5 ; ℓ 3 , ℓ 4 , and ℓ 5 all have the same suspiciousness score s = 4. Thus, they all have the same average rank 4 = (3 + 4 + 5)/3, which is a proxy of how many entities will need to be inspected if ℓ 4 were faulty but ℓ 2 were not.
Capturing the "average number of inspected entities" is trickier still if more than one entity is faulty among a bunch of tied entities.Consider now all of Table 3's ten program entities; entities ℓ 8 , ℓ 9 , and ℓ 10 all have the suspiciousness score s = 2; ℓ 8 and ℓ 9 are faulty, whereas ℓ 10 is not.Their average rank 9 = (8 + 9 + 10)/3 overestimates the number of entities to be inspected (assuming now that these are the only faulty entities in the output), since two entities out of three are faulty, and hence it is more likely that the faulty entity will appear before rank 9.
To properly account for such scenarios, Zou et al. [78] introduced the E inspect metric, which ranks a program entity ℓ within a list ⟨ℓ 1 , s 1 ⟩ . . .⟨ℓ n , s n ⟩ of program entities ℓ 1 , . . ., ℓ n with suspiciousness scores s 1 ≥ . . .≥ s n as: In (6), start(ℓ) is the position k of the first entity among those with the same score as ℓ's; ties(ℓ) is the number of entities (including ℓ itself) whose score is the same as ℓ's; and faulty(ℓ) is the number of entities (including ℓ itself) that tie ℓ's score and are faulty (that is ℓ ∈ F (b)).Intuitively, the E inspect rank I b (ℓ, ⟨ℓ 1 , s 1 ⟩ . . .⟨ℓ n , s n ⟩) is thus an average of all possible ranks where tied and faulty entities are shuffled randomly.When there are no ties, or only one entity among a group of ties is faulty, (6) coincides with the average rank.Henceforth, we refer to a location's E inspect rank I b (ℓ, ⟨ℓ 1 , s 1 ⟩ . . .⟨ℓ n , s n ⟩) as simply its rank.
Better vs. worse ranks.A clarification about terminology: a high rank is a rank that is close to the top-1 rank (the first rank), whereas a low rank is a rank that is further away from the top-1 rank.Correspondingly, a high rank corresponds to a small numerical ordinal value; and a low rank corresponds to a large numerical ordinal value.Consistently with this standard usage, the rest of the paper refers to "better" ranks to mean "higher" ranks (corresponding to smaller ordinals); and "worse" ranks to mean "lower" ranks (corresponding to larger ordinals).Generalized E inspect effectiveness.What happens if a FL technique L cannot localize a bug b-that is, b's faulty entities F (b) do not appear at all in L's output?According to ( 6) and ( 7), I b (L) is undefined in these cases.This is not ideal, as it fails to measure the effort wasted going through the location list when using L to localize b-the original intuition behind all rank metrics.Thus, we introduce a generalization L's E inspect rank on bug b as follows.Given the list L(b) = ⟨ℓ 1 , s 1 ⟩ . . .⟨ℓ n , s n ⟩ of entities and suspiciousness scores returned by L running on b, let L ∞ (b) = ⟨ℓ 1 , s 1 ⟩ . . .⟨ℓ n , s n ⟩ ⟨ℓ n+1 , s 0 ⟩⟨ℓ n+2 , s 0 ⟩ . . .be L(b) followed by all other entities ℓ n+1 , ℓ n+1 , . . . in program p − b that are not returned by L, each given a suspiciousness s 0 < s n lower than any suspiciousness scores assigned by L.

Fault Localization Effectiveness Metrics
With this definition, a longer list than L 1 : all else being equal, a technique that returns a shorter list is "better" than one that returns a longer list since it requires less of the user's time to inspect the output list.Accordingly, I b (L) denotes L's generalized E inspect rank on bug b-defined as in (7).
Exam score effectiveness.Another commonly used effectiveness metric is the exam score E b (L) [69], which is just a FL technique L's E inspect rank on bug b over the number of program entities |p − b | of the analyzed buggy program p − b -as in (7).Just like Effectiveness of a technique.To assess the overall effectiveness of a FL technique over a set B of bugs, we aggregate the previously introduced metrics in different ways-as in (8).The L@ B n metric counts the number of bugs in B that L could localize within the top-n positions (according to their E inspect rank); n = 1, 3, 5, 10 are common choices for n, reflecting a "feasible" number of entities to inspect.Then, the L@ B n% = 100 • L@ B n/|B| metric is simply L@ B n expressed as a percentage of the Different FL families use different kinds of information to compute suspiciousness scores; this is also reflected by the entities that may appear in their output location list.SBFL techniques include all locations executed by any tests T b (passing or failing) even if their suspiciousness is zero; conversely, they omit all locations that are not executed by the tests.MBFL techniques include all locations executed by any failing tests F b , since these locations are the targets of the mutation operators.PS includes all locations of predicates (branching conditions) that are executed by any failing tests F b and that are critical (as defined in Section 2.3).ST includes all locations of all functions that appear in the stack trace of any crashing test in F b .
Effectiveness metrics: limitations.Despite being commonly used in fault localization research, the effectiveness metrics presented in this section rely on assumptions that may not realistically capture the debugging work of developers.First, they assume that a developer can understand the characteristics of a bug and devise a suitable fix by examining just one buggy entity; in contrast, debugging often involves disparate activities, such as analyzing control and data dependencies and inspecting program states with different inputs [47].Second, debugging is often not a linear sequence of activities [31] as simple as going through the ranked list of entities produced by fault localization techniques.Despite these limitations, we still rely on this section's effectiveness metrics: on the one hand, they are used in practically all related work on fault localization (in particular, Zou et al. [77]); thus, they make our results comparable to others.On the other hand, there are no viable, easy-to-measure alternative metrics that are also fully realistic; devising such metrics is outside this paper's scope and belongs to future work.A small value of p is commonly taken as evidence against the "null-hypothesis" that the distributions underlying M B (F 1 ) and M B (F 2 ) have different medians: 15 usually, p ≤ 0.05, p ≤ 0.01, and p ≤ 0.001 are three conventional thresholds of increasing strength.Cliff's δ effect size-a nonparametric measure of how often the values in M B (F 1 ) are larger than those in M B (F 2 ).
The absolute value |δ| of the effect size δ measures how much the values of metric M differ, on the same bugs, between F 1 and F 2 [54]: if 0 ≤ |δ| < 0.147 the differences are negligible; if 0.145 ≤ |δ| < 0.33 the differences are small; if 0.33 ≤ |δ| < 0.474 the differences are medium; and if 0.474 ≤ |δ| ≤ 1 the differences are large.
Regression models.To ferret out the individual impact of several different factors (fault localization family, project category, and bug kind) on the capabilities of fault localization, we introduce two varying effects regression models with normal likelihood and logarithmic link function.
Model ( 9) is multivariate, as it simultaneously captures effectiveness and runtime cost of fault localization.For each fault localization experiment on a bug b, (9) expresses the vector [E b , T b ] of standardized 16  ), for the interactions between each family and bugs with different mutability. 17Variables crashing and predicate are indicator variables, which are equal to 1 respectively for crashing or predicate-related bugs, and 0 otherwise; variable mutability is instead the mutability percentage defined in Section 4.3.
After completing regression models ( 9) and ( 10) with suitable priors and fitting them on our experimental data 18 gives a (sampled) distribution of values for the coefficients α's, c, p, m, and β's, which we can analyze to infer the effects of the various predictors on the outcome.For example, if the 95% probability interval of α F 's distribution lies entirely below zero, it suggests that FL family F is consistently associated with below-average values of E inspect metric I; in other words, F tends to be more effective than techniques in other families.As another example, if the 95% probability interval of β C 's distribution includes zero, it suggests that bugs in projects of category C are not consistently associated with different-than-average running times; in other words, bugs in these projects do not seem either faster or slower to analyze than those in other projects.

Experimental Methodology
To answer Section 4's research questions, we ran FAUXPY using each of the 7 fault localization techniques described in Section 2 on all 135 selected bugs (described in Section 4.1) from BUGSINPY v. b4bfe91, for a total of 945 = 7 × 135 FL experiments.Henceforth, the term "standalone techniques" refers to the 7 classic FL techniques described in Section 2; whereas "combined techniques" refers to the four techniques introduced for RQ4.
Test selection.The test suites of projects such as keras (included in BUGSINPY) are very large and can take more than 24 hours to run even once.Without a suitable test selection strategy, large-scale FL experiments would be prohibitively time consuming (especially for MBFL techniques, which rerun the same test suite hundreds of times).Therefore, we applied a simple test selection strategy to only include tests that directly target the parts of a program that contribute to the failures. 19s we mentioned in Section 4.1, each bug b in BUGSINPY comes with a selection of failing tests F b and passing tests P b .The failing tests are usually just a few, and specifically trigger bug b.The passing tests, in contrast, are much more numerous, as they usually include all non-failing tests available in the project.In order to cull the number of passing tests to only include those that expressly target the failing code, we applied a simple dependency analysis: for each BUGSINPY bug b used in our experiments, we built the module-level call graph G(b) for the whole of b's project; 20 each node in G(b) is a module of the project (including its tests), and each edge x m → y m means that module x m directly uses some entities defined in module y m .Consider any of b's project test module t m ; we run the tests in t m in our experiments if and only if: i) t m includes at least one of the failing tests in F b ; ii) or, G(b) includes an edge t m → f m , where f m is a module that includes at least one of b's faulty locations F (b) (see Section 4.2).In other words: we include all failing tests for b, as well as the passing tests that directly exercise the parts of the project that are faulty.This simple heuristics substantially reduced the number of tests that we had to run for the largest projects, without meaningfully affecting the fault localization's scope.
Our test selection strategy does not include test modules that indirectly involve failing locations (unless they include any failing tests): if the tests in a module t m only call directly an application module x m , and then some parts of module x m call another application module y m (i.e., t m → x m → y m in the module-level call graph), x m does not include any faulty locations, and y m does include some faulty locations, then we do not include the tests in t m in our test suite; instead, we will include other test modules u m that directly call y m (i.e., u m → y m ).
To demonstrate that our more aggressive test selection strategy does not exclude any relevant tests, and is unlikely to affect the quantitative fault localization results, we first computed, for each bug b used in our experiments: i) the set S 0 b of tests selected using the strategy described above; and ii) the set S + b ⊇ S 0 b of tests selected by including also indirect dependencies (i.e., by taking the transitive closure of the module-level use relation).For 48% of the 135 bugs used in our experiments, S + b = S 0 b , that is both test selection strategies select the same tests.However, there remain a long tail of bugs for which including indirect dependencies leads to many more tests being selected; for example, for 40 bugs in 7 projects, considering indirect dependencies leads to selecting more than 50 additional tests-which would significantly increase the experiments' running time.Thus, we randomly selected one bug for each project among those 40 bugs for which indirect dependencies would lead to including more than 50 additional tests.For each bug b in this sample, we performed an additional run of our fault localization experiments with SBFL and MBFL techniques21 using all tests in S + b , for a total of 35 new experiments.We found that none of the key fault localization effectiveness metrics significantly changed compared to the same experiments using only tests in S 0 b . 22his confirms that our test selection strategy does not alter the general effectiveness of fault localization, and hence we adopted it for the rest of the paper's experiments.
Table 4 shows statistics about the fraction of tests that we selected for our experiments according to the test selection strategy.Those data indicate that test selection has a disproportionate impact on projects that have very large test suites, such as those in the DS category.In these projects, it happens often that the vast majority of tests are irrelevant for the portion of the project where a failure occurred; therefore, excluding these tests from our experiments is instrumental in drastically bringing down execution times without sacrificing experimental accuracy.
Experimental setup.Each experiment ran on a node of USI's HPC cluster, 23 each equipped with 20-core Intel Xeon E5-2650 processor and 64 GB of DDR4 RAM, accessing a shared 15 TB RAID 10 SAS3 drive, and running CentOS 8.2.2004.x86_64.We provisioned three CPython Virtualenvs with Python v. 3.6, 3.7, and 3.8; our scripts chose a version according to the requirements of each BUGSINPY subject.The experiments took more than two CPU-months to complete-not counting the additional time to setup the infrastructure, fix the execution scripts, and repeat any experiments that failed due to incorrect configuration.This paper's detailed replication package includes all scripts used to ran these experiments, as well as all raw data that we collected by running them.The rest of this section details how we analyzed and summarized the data to answer the various research questions. 24

RQ1. Effectiveness
To answer RQ1 (fault localization effectiveness), we report the L@ B 1%, L@ B 3%, L@ B 5%, and L@ B 10% counts, the average generalized E inspect rank I B (L), the average exam score E B (L), and the average location list length |L B | for each technique L among Section 2's seven standalone fault localization techniques; as well as the same metrics averaged over each of the four fault localization families.These metrics measure the effectiveness of fault localization from different angles.We report these measures for all 135 BUGSINPY bugs B selected for our experiments.To qualitatively summarize the effectiveness comparison between two FL techniques A and B, we consider their counts A@1% ≤ A@3% ≤ A@5% ≤ A@10% and B@1% ≤ B@3% ≤ B@5% ≤ B@10% and compare them pairwise: A@k% vs. B@k%, for the each k among 1, 3, 5, 10.We say that: A ≫ B: "A is much more effective than B", if A@k% > B@k% for all ks, and A@k% − B@k% ≥ 10 for at least three ks out of four; A > B: "A is more effective than B", if A@k% > B@k% for all ks, and A@k% − B@k% ≥ 5 for at least one k out of four; A ≥ B: "A tends to be more effective than B", if A@k% ≥ B@k% for all ks, and A@k% > B@k% for at least three ks out of four; A ≃ B: "A is about as effective as B", if none of A ≫ B, A > B, A ≥ B, B ≫ A, B > A, and B ≥ A holds.
To visually compare the effectiveness of different FL families, we use scatterplots-one for each pair F 1 , F 2 of families.The scatterplot comparing F 1 to F 2 displays one point at coordinates (x, y) for each bug b analyzed in our experiments.Coordinate x = I b (F 1 ), that is the average generalized E inspect rank that techniques in family F 1 achieved on b; similarly, y = I b (F 2 ), that is the average generalized E inspect rank that techniques in family F 2 achieved on b.Thus, points lying below the diagonal line x = y (such that x > y) correspond to bugs for which family F 2 performed better (remember that a lower E inspect score means more effective fault localization) than family F 1 ; the opposite holds for points lying above the diagonal line.The location of points in the scatterplot relative to the diagonal gives a clear idea of which family performed better in most cases.
To analytically compare the effectiveness of different FL families, we report the estimates and the 95% probability intervals of the coefficients α F in the fitted regression model (9), for each FL family F. If the interval of values lies entirely below zero, it means that family F's effectiveness tends to be better than the other families on average; if it lies entirely above zero, it means that family F's effectiveness tends to be worse than the other families; and if it includes zero, it means that there is no consistent association (with above-or below-average effectiveness).

RQ2. Efficiency
To answer RQ2 (fault localization efficiency), we report the average wall-clock running time T B (L), for each technique L among Section 2's seven standalone fault localization techniques, on bugs in B; as well as the same metric averaged over each of the four fault localization families.This basic metric measures how long the various FL techniques take to perform their analysis.We report these measures for all 135 BUGSINPY bugs B selected for our experiments.
To qualitatively summarize the efficiency comparison between two FL techniques A and B, we compare pairwise their average running times T(A) and T(B), and say that:

• T(B);
A ≃ B: "A is about as efficient as B", if none of A ≫ B, A > B, B ≫ A, and B > A holds.
To visually compare the efficiency of different FL families, we use scatterplots-one for each pair F 1 , F 2 of families.The scatterplot comparing F 1 to F 2 displays one point at coordinates (x, y) for each bug b analyzed in our experiments.Coordinate x = T b (F 1 ), that is the average running time of techniques in family F 1 on b; similarly, y = T b (F 2 ), that is the average running time of techniques in family F 2 on b.The interpretation of these scatterplots is as those considered for RQ1.
To analytically compare the efficiency of different FL families, we report the estimates and the 95% probability intervals of the coefficients β F in the fitted regression model ( 9), for each FL family F. The interpretation of the regression coefficients' intervals is similar to those considered for RQ1: β F 's lies entirely above zero when F tends to be slower (less efficient) than other families; it lies entirely below zero when F tends to be faster; and it includes zero when there is no consistent association with above-or below-average efficiency.

RQ3. Kinds of Faults and Projects
To answer RQ3 (fault localization behavior for different kinds of faults and projects), we report the same effectiveness metrics considered in RQ1 (F@ X 1%, F@ X 3%, F@ X 5%, and F@ X 10% percentages, average generalized E inspect ranks I X (F), average exam scores E X (F), and average location list length |F X |), as well as the same efficiency metrics considered in RQ2 (average wall-clock running time T X (F)) for each standalone fault localization family F and separately for i) bugs X of different kinds: crashing bugs, predicate bugs, and mutable bugs (see Figure 4); ii) bugs X from projects of different category: CL, DEV, DS, and WEB (see Section 4.3).
To visually compare the effectiveness and efficiency of fault localization families on bugs from projects of different category, we color the points in the scatterplots used to answer RQ1 and RQ2 according to the bug's project category.
To analytically compare the effectiveness of different FL families on bugs of different kinds, we report the estimates and the 95% probability intervals of the coefficients c F , p F , and m F in the fitted regression model (10), for each FL family F. The interpretation of the regression coefficients' intervals is similar to those considered for RQ1 and RQ2: c F , p F , and m F characterize the effectiveness of family F respectively on crashing, predicate, and mutable bugs, relative to the average effectiveness of the same family F on other kinds of bugs.
Finally, to understand whether bugs from projects of certain categories are intrinsically harder or easier to localize, we report the estimates and the 95% probability intervals of the coefficients α C and β C in the fitted regression model (9), for each project category C. The interpretation of these regression coefficients' intervals is like those considered for RQ1 and RQ2; for example if α C 's interval is entirely below zero, it means that bugs of projects in category C are easier to localize (higher effectiveness) than the average of bugs in any project.This sets a baseline useful to interpret the other data that answer RQ3.

RQ4. Combining Techniques
To answer RQ4 (the effectiveness of combining FL techniques), we consider two additional fault localization techniques: CombineFL and AvgFL-both combining the information collected by some of Section 2's standalone techniques from different families.
CombineFL was introduced by Zou et al. [78]; it uses a learning-to-rank model to learn how to combine lists of ranked locations given by different FL techniques.After fitting the model on labeled training data, 25 one can use it like any other fault localization technique as follows: i) Run any combination of techniques L 1 , . . ., L n on a bug b; ii) Feed the ranked location lists output by each technique into the fitted learning-to-rank model; iii) The model's output is a list ℓ 1 , ℓ 2 , . . . of locations, which is taken as the FL output of technique CombineFL.We used Zou et al. [78]'s replication package to run CombineFL on the Python bugs that we analyzed using FAUXPY.
To see whether a simpler combination algorithm can still be effective, we introduced the combined FL technique AvgFL, which works as follows: i) Each basic technique L k returns a list ⟨ℓ k 1 , s k 1 ⟩ . . .⟨ℓ k n k , s k n k ⟩ of locations with normalized 26 suspiciousness scores 0 ≤ s k j ≤ 1; ii) AvgFL assigns to location ℓ x the weighted average ∑ k w k s k x , where k ranges over all of FL techniques supported by FAUXPY but Tarantula, and w k is an integer weight that depends on the FL family of k: 3 for SBFL, 2 for MBFL, and 1 for PS and ST; 27 iii) The list of locations ranked by their weighted average suspiciousness is taken as the FL output of technique AvgFL.
Finally, we answer RQ4 by reporting the same effectiveness metrics considered in RQ1 (the L@ B 1%, L@ B 3%, L@ B 5%, and L@ B 10% counts, the average generalized E inspect rank I B (L), the average exam score E B (L), and the average location list length |L B |) for techniques CombineFL and AvgFL.Precisely, we consider two variants A and S of CombineFL and of AvgFL, giving a total of four combined fault localization techniques: variants A (CombineFL A and AvgFL A ) use the output of all FL techniques supported by FAUXPY but Tarantula-which was not considered in [78]; variants S (CombineFL S and AvgFL S ) only use the Ochiai, DStar, and ST FL techniques (excluding the more time-consuming MBFL and PS families).

RQ5. Granularity
To answer RQ5 (how fault localization effectiveness changes with granularity), we report the same effectiveness metrics considered in RQ1 (the L@ B 1, L@ B 3, L@ B 5, and L@ B 10 counts, the average generalized E inspect rank I B (L), the average exam score E B (L), and the average location list length |L B |) for all seven standalone techniques, and for all four combined techniques, but targeting functions and modules as suspicious entities.Similar to Zou et al. [78], for function-level and module-level granularities, we define the suspiciousness score of an entity as the maximum suspiciousness score computed for the statements in them.

RQ6. Comparison to Java
To answer RQ6 (comparison between Python and Java), we quantitatively and qualitatively compare the main findings of Zou et al. [78]-whose empirical study of fault localization in Java was the basis for our Python replicationagainst our findings for Python.
For the quantitative comparison of effectiveness, we consider the metrics that are available in both studies: the percentage of all bugs each technique localized within the top-1, top-3, top-5, and top-10 positions of its output (L@1%, L@3%, L@5%, and L@10%); and the average exam score.For Python, we consider all 135 BUGSINPY bugs we selected for our experiments; the data for Java is about Zou et al.'s experiments on 357 bugs in Defects4J [28].We consider all standalone techniques that feature in both studies: Ochiai and DStar (SBFL), Metallaxis and Muse (MBFL), predicate switching (PS), and stack-trace fault localization (ST).
We also consider the combined techniques CombineFL A and CombineFL S .The original idea of the CombineFL technique was introduced by Zou et al.; however, the variants used in their experiments combine all eleven FL techniques they consider, some of which we did not include in our replication (see Section 3 for details).Therefore, we modified [78]'s replication package to extract from their Java experimental data the rankings obtained by CombineFL A and CombineFL S combining the same techniques as in Python (see Section 4.7.4).This way, the quantitative comparison between Python and Java involves exactly the same techniques and combinations thereof.
Since we did not re-run Zou et al.'s experiments on the same machines used for our experiments, we cannot compare efficiency quantitatively.Anyway, a comparison of this kind between Java and Python would be outside the scope of our studies, since any difference would likely merely reflect the different performance of Java and Python-largely independent of fault localization efficiency.
For the qualitative comparison between Java and Python, we consider the union of all findings presented in this paper or in Zou et al. [78]; we discard all findings from one paper that are outside the scope of the other paper (for example, Java findings about history-based fault localization, a standalone technique that we did not implement for Python; or Python findings about AvgFL, a combined technique that Zou et al. did not implement for Java); for each within-scope finding, we determine whether it is confirmed (there is evidence corroborating it) or refuted (there is evidence against it) for Python and for Java.

Experimental Results
This section summarizes the experimental results that answer the research questions detailed in Section 4.7.All results except for Section 5.5's refer to experiments with statement-level granularity; results in Sections Section 5.1-5.3only consider standalone techniques.To keep the discussion focused, we mostly comment on the @n% metrics of effectiveness, whereas we only touch upon the exam score, E inspect , and location list length when they complement other results.

RQ1. Effectiveness
Family effectiveness.Among standalone techniques, the SBFL fault localization family achieves the best effectiveness according to several metrics.Table 5 shows that all SBFL techniques have better average E inspect rank I; and ; the percentage of all bugs it localized within the top-1, top-3, top-5, and top-10 positions of its output (L@ B 1%, L@ B 3%, L@ B 5%, and L@ B 10%); its average exam score E B (L); and its average suspicious locations length |L B |. Columns F report the same metrics averaged for all techniques that belong to the same FAMILY.Highlighted numbers denote the best technique according to each metric.
higher percentages of faulty locations in the top-1, top-3, top-5, and top-10.The advantage over MBFL-the second most-effective family-is consistent and conspicuous.According to the same metrics, the MBFL fault localization family achieves clearly better effectiveness than PS and ST.Then, PS tends to do better than ST, but only according to some metrics: PS has better @1%, @3%, and @5%, and location list length, whereas ST has better E inspect and @10%.Contrary to these general trends, PS achieves the best (lowest) exam score and location list length of all families; and ST is second-best according to these metrics.As Section 5.3 will discuss in more detail, PS and ST are techniques with a narrower scope than SBFL and MBFL: they can perform very well on a subset of bugs, but they fail spectacularly on several others.They also tend to return shorter lists of suspicious locations, which is also conducive to achieving a better exam score: since the exam score is undefined when a technique fails to localize a bug at all (as explained in Section 4.5), the average exam score of ST and, especially, PS is computed over the small set of bugs on which they work fairly well.Figure 6's scatterplots confirm SBFL's general advantage: in each scatterplot involving SBFL, all points are on a straight line corresponding to low ranks for SBFL but increasingly high ranks for the other family.The plots also indicate that MBFL is often better than PS and ST, although there are a few hard bugs for which the latter are just as effective (points on the diagonal line).The PS-vs-ST scatterplot suggests that these two techniques are largely complementary: on several bugs, PS and ST are as effective (points on the diagonal); on several others, PS is more effective (points above the diagonal); and on others still, ST is more effective (points below the diagonal).
Figure 7a confirms these results based on the statistical model ( 9): the intervals of coefficients α SBFL and α MBFL are clearly below zero, indicating that SBFL and MBFL have better-than-average effectiveness; conversely, those of coefficients α PS and α ST are clearly above zero, indicating that PS and ST have worse-than-average effectiveness.Figure 7a's estimate of α SBFL is below that of α MBFL , confirming that SBFL is the most effective family overall.The bottom-left plot in Figure 6 confirms that SBFL's advantage can be conspicuous but is observed only on a minority of bugs-whereas SBFL and MBFL achieve similar effectiveness on the majority of bugs.In fact, the effect size comparing SBFL and MBFL is −0.18-weakly in favor of SBFL.
Finding 1.4: SBFL and MBFL often achieve similar effectiveness; however, SBFL is strictly better than MBFL on a minority of bugs.
Technique effectiveness.FL techniques of the same family achieve very similar effectiveness.Table 5 shows nearly identical results for the 3 SBFL techniques Tarantula, Ochiai, and DStar.The plots and statistics in Figure 8 confirm this: points lie along the diagonal lines in the scatterplots, and E inspect ranks for the same bugs are strongly correlated and differ by a vanishing effect size.The 2 MBFL techniques also behave similarly, but not quite as closely as the SBFL ones.Metallaxis has a not huge but consistent advantage over Muse according to Table 5. Figure 9 corroborates this observation: the cloud of points in the scatterplot is centered slightly above the diagonal line; the correlation between Muse's and Metallaxis's data is medium (not strong); and the effect size suggests that Metallaxis is more effective on around 11% of subjects.
Muse's lower effectiveness can be traced back to its stricter definition of "mutant killing", which requires that a failing test becomes passing when run on a mutant (see Section 2.2).As observed elsewhere [49], this requirement may be too demanding for fault localization of real-world bugs, where it is essentially tantamount to generating a mutant that is similar to a patch.

Finding 1.6:
The techniques in the MBFL family achieve generally similar effectiveness, but Metallaxis tends to be better than Muse.

RQ2. Efficiency
As demonstrated in Table 6, the four FL families differ greatly in their efficiency-measured as their wall-clock running time.ST is by far the fastest, taking a mere 2 seconds per bug on average; SBFL is second-fastest, taking around 10 minutes on average; PS is one order of magnitude slower, taking approximately 2.7 hours on average; and MBFL is slower still, taking over 4 hours per bug on average.but there are also several points that are on the opposite side of the diagonal-and their effect size (0.34) is medium, lower than all other pairwise effect sizes in the comparison of efficiency.
Finding 2.2: PS is more efficient than MBFL on average; however, the two families tend to be faster or slower on different bugs.
Based on the statistical model ( 9), Figure 7a clearly confirms the differences of efficiency: the intervals of coefficients β ST and β SBFL are well below zero, indicating that ST and SBFL are faster than average (with ST the fastest, as its estimated β ST is lower); conversely, the intervals of coefficients β MBFL and β PS are entirely above zero, indicating that MBFL and PS stand out as slower than average compared to the other families.These major differences in efficiency are unsurprising if one remembers that the various FL families differ in what kind of information they collect for localization.ST only needs the stack-trace information, which only requires to run once the failing tests; SBFL compares the traces of passing and failing runs, which involves running all tests once.PS dynamically tries out a large number of different branch changes in a program, each of which runs the failing tests; in our experiments, PS tried 4 588 different "switches" on average for each bug-up to a whopping 101 454 switches for project black's bug #6.MBFL generates hundreds of different mutations of the program under analysis, each of which has to be run against all tests; in our experiments, MBFL generated 461 mutants on average for each bug-up to 2 718 mutants for project black's bug #6.After collecting this information, the additional running time to compute suspiciousness scores (using the formulas presented in Section 2) is negligible for all techniques-which explains why the running times of techniques of the same family are practically indistinguishable. ) quantitatively comparing the same per-bug average running times of C and R; negative values of effect size mean that R tends to be better, and positive values that C tends to be better.

RQ3. Kinds of Faults and Projects
Project category: effectiveness.Figure 7's intervals of coefficients α category in model (9) indicate that fault localization tends to be more accurate on projects in categories DEV and WEB, and less accurate on projects in categories CL and DS.
This finding is consistent with the observations that data science programs, their bugs, and their fixes are often different compared to traditional programs [22,23].For instance, bug #38 in project keras is an example of what Islam et al. call "structural data flow" bugs [22]: its root cause is passing an incorrect input shape setting to a neural network layer.These characteristics also determine long spectra (i.e., execution traces) that span several functions-which are required to construct the various layer objects; as a result, SBFL techniques struggle to effectively localize this bug.Bugs #68 and #137 in project pandas are instead examples of API bugs, whose root causes are incorrect import statements.While such bugs may occur in any kind of project, they are common in data science programs [22] due to their complex dependencies.In Python, import statements are usually top-level declarations; therefore, FL techniques that can only target locations inside functions end up being ineffective at localizing these API bugs.As yet another example, the overall mutability of bugs in DS projects is 0.7%, whereas it is 1.3% for bugs in other categories of projects.This indicates that the standard mutation operators, used by MBFL, are a poor fit for the kinds of bugs that are most commonly found in data science projects.The data in Table 7's bottom section confirm that SBFL remains the most effective FL family, largely independent of the category of projects it analyzes.MBFL ranks second for effectiveness in every project category; it is not that far from SBFL for projects in categories DEV and CL (for example, MBFL and SBFL both localize 9% of CL bugs in the first position; and both localize over 40% of DEV bugs in the top-10 positions).In contrast, SBFL's advantage over MBFL is more conspicuous for projects in categories DS and WEB.Given that bugs in categories CL are generally BUGS X Table 7: Effectiveness of fault localization families at the statement-level granularity on different kinds of bugs and categories of projects.Each row reports a FAMILY F's average generalized E inspect rank I X (F); the percentage of all bugs it localized within the top-1, top-3, top-5, and top-10 positions of its output (F@ X 1%, F@ X 3%, F@ X 5%, and F@ X 10%); its average exam score E X (F) and the length |F X | of the output list of locations on different groups X of bugs: ALL bugs selected for the experiments (same results as in Table 5); bugs of different kinds (CRASHING, PREDICATE-related, and MUTABLE bugs); and bugs from projects of different categories (CL, DEV, DS, and WEB).Highlighted numbers denote the best family on each group of bugs according to each metric.
harder to localize, this suggests that the characteristics of bugs in these projects seem to be a good fit for MBFL.
As we have seen in Section 5.2, MBFL is the slowest FL family by far; since it reruns the available tests hundreds, or even thousands, of times, projects with a large number of tests are near impossible to analyze efficiently with MBFL.As we'll discuss below, MBFL is considerably faster on projects in category CL than on projects in other categories; this is probably the main reason why MBFL is also more effective on these projects: it simply generates a more manageable number of mutants, which sharpen the dynamic analysis.Figure 6's plots confirm some of these trends.In most plots, we see that the points positioned far apart from the diagonal line correspond to projects in the CL and DS categories, confirming that these "harder" bugs exacerbate the different effectiveness of the various FL families.
Project category: efficiency.Figure 7's intervals of coefficients β category in model (9) indicate that fault localization tends to be more efficient (i.e., faster) on projects in category CL, and less efficient (i.e., slower) on projects in category DS (β DS barely touches zero).In contrast, projects in categories DEV and WEB do not have a consistent association with faster or slower fault localization.Table 2 shows that projects in category DS have the largest number of tests by far (mostly because of outlier project pandas); furthermore, some of their tests involve training and testing different machine learning models, or other kinds of time-consuming tasks.Since FL invariably requires to run tests, this explains why bugs in DS projects tend to take longer to localize.Figure 11: Point estimates (boxes) and 95% probability intervals (lines) for the regression coefficients of model (10).
The scale of the vertical axes is over standard deviation log-units.
Finding 3.3: Bugs in data science projects challenge fault localization's efficiency (that is, they take longer to localize) more than bugs in other categories of projects.
The data in Table 6's right-hand side generally confirm the same rankings of efficiency among FL families, largely regardless of what category of projects we consider: ST is by far the most efficient, followed by SBFL, and then-at a distance-PS and MBFL.The difference of performance between SBFL and ST is largest for projects in category DS (three orders of magnitude), large for projects in category WEB (two orders of magniture), and more moderate for projects in categories CL and DEV (one order of magnitude).PS is slower than MBFL only for projects in category DEV, although their absolute difference of running times is not very big (around 7.5%); in contrast, it is one order of magnitude faster for projects in categories CL and WEB.

Finding 3.4:
The difference in efficiency between MBFL and SBFL is largest for data science projects.
In most of Figure 10's plots, we see that the points most frequently positioned far apart from the diagonal line correspond to projects in category DS, confirming that these bugs take longer to analyze and aggravate performance differences among techniques.In the scatterplot comparing MBFL to PS, points corresponding to projects in categories WEB and CL are mostly below the diagonal line, which corroborates the advantage of PS over MBFL for bugs of projects in these two categories.
Crashing bugs: effectiveness.According to Figure 11a, both FL families ST and MBFL are more effective on crashing bugs than on other kinds of bugs.Still, their absolute effectiveness on crashing bugs remains limited compared to SBFL's, as shown by the results in Table 7's middle part; for example, @ CRASHING 10% is 37% for ST, 34% for MBFL, and 53% for SBFL, whereas ST localizes zero (crashing) bugs in the top rank.Remember that ST assigns that same suspiciousness to all statements within the same function (see Section 2.4); thus, it cannot be as accurate as SBFL even on the minority of crashing bugs.
Finding 3.5: ST and MBFL are more effective on crashing bugs than on other kinds of bugs (but they remain overall less effective than SBFL even on crashing bugs).
On the other hand, PS is less effective on crashing bugs than on other kinds of bugs; in fact, it localizes zero bugs among the top-10 ranks.PS has a chance to work only if it can find a so-called critical predicate (see Section 2.3); only three of the crashing bugs included critical predicates, and hence PS was a bust.Predicate-related bugs: effectiveness.Figure 11b says that no FL family achieves consistently better or worse effectiveness on predicate-related bugs.Table 7 complements this observation; the ranking of families by effectiveness is different for predicate-related bugs than it is for all bugs: MBFL is about as effective as SBFL, whereas PS is clearly more effective than ST.
Finding 3.7: On predicate-related bugs, MBFL is about as effective as SBFL, and PS is more effective than ST.This outcome is somewhat unexpected for PS: predicate-related bugs are bugs whose ground truth includes at least a branching predicate (see Section 4.3), and yet PS is still clearly less effective than SBFL or MBFL.Indeed, the presence of a faulty predicate is not sufficient for PS to work: the predicate must also be critical, which means that flipping its value turns a failing test into a passing one.When a program has no critical predicates, PS simply returns an empty list of locations.In contrast, when a program has a critical predicate, PS is highly effective: PS@ χ 1% = 14%, PS@ χ 3% = 24%, and PS@ χ 5% = 31% for PS on the 29 bugs χ with a critical predicate-even better than SBFL's results for the same bugs (SBFL@ χ 1% = 13%, SBFL@ χ 3% = 16%, and SBFL@ χ 5% = 20%).In all, PS is a highly specialized FL technique, which works quite well for a narrow category of bugs, but is inapplicable in many other cases.Mutable bugs: effectiveness.According to Figure 11c, FL family MBFL tends to be more effective on mutable bugs than on other kinds of bugs: m MBFL 95% probability interval is mostly below zero (and the 87% probability interval would be entirely below zero).Furthermore, Table 7 shows that MBFL is the most effective technique on mutable bugs, where it tends to outperform even SBFL.Intuitively, a bug is mutable if the syntactic mutation operators used for MBFL "match" the fault in a way that it affects program behavior.Thus, the capabilities of MBFL ultimately depend on the nature of faults it analyzes and on the selection of mutation operators it employs.Finding 3.9: MBFL is more effective on mutable bugs than on other kinds of bugs; in fact, it is the most effective standalone fault localization family on these bugs.
Figure 11c also suggests that PS and ST are less effective on mutable bugs than on other kinds of bugs.Possibly, this is because mutable bugs tend to be more complex, "semantic" bugs, whereas ST works well only for "simple" crashing bugs, and PS is highly specialized to work on a narrow group of bugs.Bug kind: efficiency.Table 6 does not suggest any consistent changes in the efficiency of FL families when they work on crashing, predicate-related, or mutable bugs-as opposed to all bugs.In other words, for every kind of bugs: ST is orders of magnitude faster than SBFL, which is one order of magnitude faster than PS, which is 14-37% faster than MBFL.As discussed above, the kind of information that a FL technique collects is the main determinant of its overall efficiency; in contrast, different kinds of bugs do not seem to have any significant impact.

RQ4. Combining Techniques
Effectiveness.Table 8 clearly indicates that the combined FL techniques AvgFL and CombineFL achieve high effectiveness-especially according to the fundamental @n% metrics.CombineFL A and AvgFL A , combining the information from all other FL techniques, beat every other technique.For example, AvgFL A localizes in the top position 18% of all bugs, CombineFL A localizes 20% of all bugs, whereas the next-best technique is SBFL, which localizes 12% of all bugs (Table 5).CombineFL S and AvgFL S , combining the information from only SBFL and ST techniques, do at least as well as every other standalone technique.While CombineFL A is strictly more effective than AvgFL A , their difference is usually modest (at most three percentage points).Similarly, the difference between CombineFL S , AvgFL S , and SBFL is generally limited; however,   Comparisons between granularities.It is apparent that fault localization's absolute effectiveness strictly increases as we target coarser granularities-from statements, to functions, to modules.This happens simply because the number of entities at a coarser granularity is considerably less than the number of entities at a finer granularity: each function consists of several statements, and each module consists of several functions.Therefore, it does not make sense to directly compare the same effectiveness metric measured at two different granularity levels, since each granularity level refers to different entities-and inspecting different entities involves incomparable effort.We do not discuss efficiency (i.e., running time) in relation to granularity: the running time of our fault localization techniques does not depend on the chosen level of granularity, which only affects how the collected information is combined (see Section 2).

RQ6. Comparison to Java
Table 11 collects the main quantitative results for Python fault localization effectiveness that we presented in detail in previous parts of the paper, and displays them next to the corresponding results for Java.The results are selected so that they can be directly compared: they exclude any technique (e.g., Tarantula) or family (e.g., history-based FAMILY TECHNIQUE L L@1% L@3% L@5% L@10% E (L) percentage of all bugs it localized within the top-1, top-3, top-5, and top-10 positions of its output (L@1%, L@3%, L@5%, and L@10%); and its average exam score E (L).Python's data corresponds to the experiments discussed in the rest of the paper on the 135 bugs from BUGSINPY; Java's data is taken from Zou et al.'s empirical study [78] or computed from its replication package.Highlighted numbers denote each language's best technique according to each metric.
fault localization) that was not experimented within both our paper and Zou et al. [78]; and the rows about Com-bineFL were computed using [78]'s replication package so that they combine exactly the same techniques (DStar, Ochiai, Metallaxis, Muse, PS, and ST for CombineFL A ; and DStar, Ochiai, and ST for CombineFL S ).Then, Table 12 lists all claims about fault localization made in our paper or in [78] that are within the scope of both papers, and shows which were confirmed or refuted for Python and for Java.Most of the findings (25/28) were confirmed consistently for both Python and Java.Thus, the big picture about the effectiveness and efficiency of fault localization is the same for Python programs and bugs as it is for Java programs and bugs.
There are, however, a few interesting discrepancies; let's discuss possible explanations for them.The most marked difference is about the effectiveness of ST, which was mediocre on Python programs but competitive on Java programs (row 3 in Table 12).We think the main reason for these differences is that there were more Java experimental subjects that were an ideal target for ST: 20 out of the 357 Defects4J bugs used in [78]'s experiments consisted of short failing methods whose programmer-written fixes entirely replaced or removed the method body. 28n these cases, the ground truth consists of all locations within the method; thus, ST easily ranks the fault location at the top by simply reporting all lines of the crashing method with the same suspiciousness.As a result, Table 11 shows that ST was consistently more effective than PS in the Java experiments-whereas there was no consistent difference between ST and PS in our Python experiments.For the same reason, the difference between Java and Python is even more evident on crashing bugs: ST outperformed all other techniques on such bugs in Java but not in Python (row 19 in Table 12).We still confirmed that ST works better on crashing bugs than on other kinds of bugs in Python as well, but the nature of our experimental subjects did not allow ST to reach an overall competitive effectiveness on crashing bugs.
Other findings about MBFL were different in Python compared to Java, but the differences were more nuanced in this case.In particular, Zou et al. found that the correlation between the effectiveness of SBFL and MBFL techniques is negligible, whereas we found a medium correlation (τ = 0.54).It is plausible that the discrepancy (reflected in Table 12's row 23) is simply a result of several details of how this correlation was measured: we use Kendall's τ, they use the coefficient of determination r 2 ; we use a generalized E inspect measure I that applies to all bugs, they exclude experiments where a technique completely fails to localize the bug (I); we compare the average effectiveness of SBFL vs. MBFL techniques, they pairwise compare individual SBFL and MBFL techniques.Even if the correlation patterns were actually different between Python and Java, this would still have limited practical consequences: MBFL and SBFL techniques still have clearly different characteristics, and hence they remain largely complementary.The same analysis applies to the other correlation discrepancy (reflected in Table 12's row 25): in Python, we found a medium correlation between the effectiveness of the Metallaxis and Muse MBFL techniques (τ = 0.62); in Java, Zou et al. found negligible correlation.
Finally, a clarification about the finding that "On predicate-related bugs, MBFL is about as effective as SBFL", which Table 12 reports as confirmed for both Python and Java.This claim hinges on the definition of "about as effective", which we rigorously introduced in Section 4.7.1.To clarify the comparison, Table 13 displays the Python and Java data about the effectiveness of MBFL and SBFL on predicate bugs.On Python predicate-related bugs (left part of Table 13), MBFL achieves better @3%, @5%, and @10% than SBFL but a worse @1% (by only one percentage point); similarly, on Java predicate-related bugs (right part of Table 13), MBFL achieves better @1%, @3%, and @5% than   [78]).SBFL but a worse @10% (by three percentage points).In both cases, MBFL is not strictly better than SBFL, but one could argue that a clear tendency exists.Regardless of the definition of "more effective" (which can be arbitrary), the conclusion we can draw remain very similar in Python as in Java.

Threats to Validity
Construct validity refers to whether the experimental metrics adequately operationalize the quantities of interest.Since we generally used widely adopted and well-understood metrics of effectiveness and efficiency, threats of this kind are limited.The metrics of effectiveness are all based on the assumption that users of a fault localization technique process its output list of program entities in the order in which the technique ranked them.This model has been criticized as unrealistic [48]; nevertheless, the metrics of effectiveness remain the standard for fault localization studies, and hence are at least adequate to compare the capabilities of different techniques and on different programs.
Using BUGSINPY's curated collection of Python bugs helps reduce the risks involved with our selection of subjects; as we detail in Section 4.1, we did not blindly reuse BUGSINPY's bugs but we first verified which bugs we could reliably reproduce on our machines.
Internal validity can be threatened by factors such as implementation bugs or inadequate statistics, which may jeopardize the reliability of our findings.We implemented the tool FAUXPY to enable large-scale experimenting with Python fault localization; we applied the usual best practices of software development (testing, incremental development, refactoring to improve performance and design, and so on) to reduce the chance that it contains fundamental bugs that affect our overall experimental results.To make it a robust and scalable tool, FAUXPY's implementation uses external libraries for tasks, such as coverage collection and mutant generation, for which high-quality open-source implementations are available.
The scripts that we used to process and summarize the experimental results may also include mistakes; we checked the scripts several times, and validated the consistency between different data representations.
We did our best to validate the test-selection process (described in Section 4.7), which was necessary to make feasible the experiments with the largest projects; in particular, we ran fault localization experiments on about 30 bugs without test selection, and checked that the results did not change after we applied test selection.
Our statistical analysis (Section 4.6) follows best practices [15], including validations and comparisons of the chosen statistical models (detailed in the replication package).To further help future replications and internal validity, we make available all our experimental artifacts and data in a detailed replication package.
External validity is about generalizability of our findings.Using bugs from real-world open-source projects substantially mitigates the threat that our findings do not apply to realistic scenarios.Precisely, we analyzed 135 bugs in 13 projects from the curated BUGSINPY collection, which ensures a variety of bugs and project types.
As usual, we cannot make strong claims that our findings generalize to different application scenarios, or to different programming languages.Nevertheless, our study successfully confirmed a number of findings about fault localization in Java [78] (see Section 5.6), which further mitigates any major threats to external validity.
Zou et al.'s study used the Defects4J [28] curated collection of real-world Java faults as their experimental subjects; we used the BUGSINPY [67] curated collection of real-world Python faults.This invariably limits the generalizability of our findings to all Python programs, and the generalizability of our comparison to all Python vs. Java programs: the two curated collections of bugs may not represent all programs and faults in Python or Java.While there is always a risk that any selection of experimental subjects is not fully representative of the whole population, choosing standard well-known benchmarks such as Defects4J and BUGSINPY helps mitigate this threat.First, BUGSINPY was explicitly inspired by Defects4J, and was built following a very similar approach but applied to real-world open-source Python programs.Second, BUGSINPY projects were "selected as they represent the diverse domains [. . .] that Python is used for" [67, Sec.1], which bodes well for generalizability.Third, BUGSINPY and Defects4J are extensible frameworks, which have been and will be extended with new projects and bugs; thus, using them as the basis of FL studies helps to make future research in this area comparable to previous results.While BUGSINPY and Defects4J are only imperfect proxies for a fully general comparison of FL in Java and Python, they are a sensible basis given the current state of the art.

Conclusions
This paper described an extensive empirical study of fault localization in Python, based on a differentiated conceptual replication of Zou et al.'s recent Java empirical study [78].Besides replicating for Python several of their results for Java, we shed light on some nuances, and released detailed experimental data that can support further replications and analyses.
As a concluding discussion, let's highlight a few points relevant for possible follow-up work.Section 6.1 discusses a different angle for a comparison with other studies, suggested by Widyasari et al.'s recent work [68].Section 6.2 describes broader ideas to improve the capabilities of fault localization in Python.

Other Fault Localization Studies
As we discussed in Section 3, Widyasari et al.'s recent work [68] is the only other large-scale study targeting fault localization in real-world Python projects.We also explained how our study's goals and methodology is quite different from theirs; as a result, we cannot directly compare most of their findings to ours.Now that we have presented our results in detail, we are in a better position to discuss how Widyasari et al.'s methodology suggests future work that complements our own.Widyasari et al. directly compare FL effectiveness metrics (such as exam score) between their experiments on Python subjects from BUGSINPY and Pearson et al.'s experiments on Java subjects from Defects4J [49].Table 14a displays the key results of their comparison, alongside a roughly similar comparison between our experiments on Python subjects from BUGSINPY and Zou et al.'s experiments on Java subjects from Defects4J [78].The picture that emerges from these comparisons is somewhat inconclusive: in our comparison, there is a significant difference, with large effect size, between Python and Java with respect to exam scores, but not with respect to the E inspect metric; conversely, in their comparison, there is a significant difference, with large/medium effect size, between Python and Java with respect to the top-k ranks in the best-case debugging scenarios (roughly analogous to the E inspect ranking metric), whereas the differences with respect to exam scores are significant but with small effect sizes.Furthermore, the sign of the effect sizes is opposite: in our comparison, fault localization is more effective on Python programs (negative effect sizes); in their comparison, it is more effective on Java programs (positive effect sizes).It is plausible to surmise that these inconsistencies reflect differences between the effectiveness metrics, how they are measured in each study, and-most important-differences between the experimental subjects; the exam score metric, in particular, also depends on the size of the programs under analysis.As we discussed in Section 5.7, even though both benchmarks BUGSINPY and Defects4J are carefully curated and of significant size, there is the risk that they do not necessarily represent all Python and Java real-world projects and their faults.This suggests that follow-up studies targeting different projects in Python and Java (or different selections of projects from BUGSINPY and Defects4J) could help validate the generalizability of any results.Conversely, applying stricter project and bug selection criteria could also be useful not to generalize findings, but to strengthen their validity in more specific settings (for example, with projects of certain characteristics).Without provisioning stricter experimental controls, directly comparing, fault localization effectiveness metrics on sundry programs in two different programming languages, as we did in Table 14a for the sake of illustration, is unlikely to lead to clear-cut, robust findings.
Even though Widyasari et al.'s study found some statistically significant differences of effectiveness between SBFL techniques, those differences tend to be modest or insignificant.As shown in Table 14b, this is largely consistent with our findings: even though we found some weakly statistically significant differences between SBFL techniques (between DStar and Tarantula for p < 0.1, and between Ochiai and Tarantula for p < 0.06) these have little practical consequence as the effect sizes of the differences are vanishing small.
Our study did not consider two dimensions of analysis that play an important role in Widyasari et al.'s study: different debugging scenarios, and a classification of faults according to their syntactic characteristics.Debugging scenarios determine how we classify a fault as localized when it affects multiple lines.In our paper, we only considered the "best-case" scenario: as long as any of the ground-truth locations is localized, we consider the fault localized.Widyasari et al. also consider other scenarios such as the worst-case scenario (all ground-truth locations must be localized).While they did not find any significant differences in the various findings under different debugging scenarios, investigating the robustness of our empirical findings in different scenarios remains a viable direction for future work.[78]; to reflect the behavior on all bugs in these statistics, bugs that were not localized are assigned an I rank and an exam score of −1 (unlike the rest of the paper where this value is undefined).The statistics of [68] (in the four rightmost columns) are taken from its Table 5 (exam score, which they compute based on their top-k ranks) and the behavior on all bugs in these statistics, bugs that were not localized are assigned an exam score of −1 (unlike the rest of the paper where this value is undefined).The statistics of [68] (in the two rightmost columns) are taken from its Table 14.
Table 14: A summary of some data presented in Widyasari et al.'s fault localization study [68] vis-à-vis analogous data presented in this paper.

Future Work
One of the dimensions of analysis that we included in our empirical study was the classification of projects (and their bugs) in categories, which led to the finding that faults in data science projects tend to be harder and take longer to localize.This is not a surprising finding if we consider the sheer size of some of these projects (and of their test suites).However, it also highlights an important category of projects that are much more popular in Python as opposed to more "traditional" languages like Java.In fact, a lot of the exploding popularity of Python in the last decade has been connected to its many usages for statistics, data analysis, and machine learning.Furthermore, there is growing evidence that these applications have distinctive characteristics-especially when it comes to faults [22,19,53].Thus, investigating how fault localization can be made more effective for certain categories of projects is an interesting direction for related work (which we briefly discussed in Section 3).
It is remarkable that SBFL techniques, proposed nearly two decades ago [26], still remain formidable in terms of both effectiveness and efficiency.As we discussed in Section 3, MBFL was introduced expressly to overcome some limitations of SBFL.In our experiments (similarly to Java projects [78]) MBFL performed generally well but not always on par with SBFL; furthermore, MBFL is much more expensive to run than SBFL, which may put its practical applicability into question.Our empirical analysis of "mutable" bugs (Section 5.3) indicated that MBFL loses to SBFL usually when its mutation operators are not applicable to the faulty statements (which happened for nearly half of the bugs we used in our experiments); in these cases, the mutation analysis will not bring relevant information about the faulty parts of the program.These observations raise the question of whether it is possible to predict the effectiveness of MBFL based on preliminary information about a failure; and whether one can develop new mutation operators that extend the practical capabilities of MBFL to new kinds of bugs.More generally, one could try to relate the various kinds of source-code edits (add, remove, modify) [60] introduced to fix a fault to the effectiveness of different fault localization algorithms.We leave answering these questions to future research in this area.
Tarantula T (e) = F + (e)/|F| F + (e)/|F| + P + (e)/|P| (1) Unique bugs.Each bug b = ⟨p − b , p + b , F b , P b ⟩ in BUGSINPY consists of: i) a faulty version p − b of the project, such that tests in F b all fail on it (all due to the same root cause); ii) a fixed version p + b of the project, such that all tests in F b ∪ P b pass on it; iii) a collection of failing F b and passing P b tests, such that tests in P b pass on both the faulty p − b and fixed p + b versions of the project, whereas tests in F b fail on the faulty p − b version and pass on the fixed p + b version of the project.
The programmerwritten fix p + b consists of a series of edits to the faulty program p − b .Each edit can be of three kinds: i) add, which inserts into p + b a new program location; ii) remove, which deletes a program location in p − b ; iii) modify, which takes a program location in p − b and changes parts of it, without changing its location, in p + b .Take, for instance, the program in Figure 3b, which modifies the program in Figure 3a; the edited program includes two adds (lines 22, 31), one remove (line 35), and one modify (line 28).Bug b's ground truth F (b) is a set of locations in p − b that are affected by the edits, determined as follows.
Faulty program version.Lines with colored background are the ground truth locations.Extra blank lines are added for readability.Fixed program version, which edits Figure 3a's program with two adds, one ::::: modify, and one remove.

Figure 3 :
Figure 3: An example of program edit, and the corresponding ground truth faulty locations.

Figure 5 :
Figure 5: Definitions of common FL effectiveness metrics.The top row shows two variants I, I of the E inspect metric, and the exam score E , for a generic bug b and fault localization technique L. The bottom row shows cumulative metrics for a set B of bugs: the "at n" metric L@ B n, and the average I and E metrics.

E
inspect effectiveness.Building on the notion of rank-defined in Section 4.4-we measure the effectiveness of a fault localization technique L on a bug b as the rank of the first faulty program entity in the list L(b) = ⟨ℓ 1 , s 1 ⟩ . . .⟨ℓ n , s n ⟩ of entities and suspiciousness scores returned by L running on b-defined as I b (L) in(7).I b (L) is L's E inspect rank on bug b, which estimates the number of entities in L's one has to inspect to correctly localize b.

4. 6
Comparison: Statistical Models To quantitatively compare the capabilities of different fault localization techniques, we consider several standard statistics.Pairwise comparisons.Let M b (L) be any metric M measuring the capabilities of fault-localization technique L on bug b; M can be any of Section 4.5's effectiveness metrics, or L's wall-clock running time T b (L) on bug b as performance metric.Similarly, for a fault-localization family F, M b (F) denotes the average value ∑ k∈F M b (k)/|F| of M b for all techniques in family F. Given a set B = {b 1 , . . ., b n } of bugs, we compare the two vectors M using three statistics: Correlation τ between M B (F 1 ) and M B (F 2 ) computed using Kendall's τ statistics.The absolute value |τ| of the correlation τ measures how closely changes in the value of metric M for F 1 over different bugs are associated to changes for F 2 over the same bugs: if 0 ≤ |τ| ≤ 0.3 the correlation is negligible; if 0.3 < |τ| ≤ 0.5 the correlation is weak; if 0.5 < |τ| ≤ 0.7 the correlation is medium; and if 0.7 < |τ| ≤ 1 the correlation is strong.P-value p of a paired Wilcoxon signed-rank test-a nonparametric statistical test comparing M B (F 1 ) and M B (F 2 ).
E inspect metric E b and running time T b as drawn from a multivariate normal distribution whose means e b and t b are log-linear functions of various predictors.Namely, log(e b ) is the sum of a base intercept α; a family-specific intercept α family[b] , for each fault-localization family SBFL, MBFL, PS, and ST; and a category-specific intercept α category[b] , for each project category CL, DEV, DS, and WEB.The other model component log(t b ) follows the same log-linear relation.Model (10) is univariate, since it only captures the relation between bug kinds and effectiveness.For each fault localization experiment on a bug b, (10) expresses the standardized E inspect metric E b as drawn from a normal distribution whose mean e b is a log-linear function of a base intercept α; a family-specific intercept α family[b] ; and a category-specific intercept α category[b] ; a varying intercept c family[b] crashing b , for the interactions between each family and crashing bugs; a varying intercept p family[b] predicate b , for the interactions between each family and predicate bugs; and a varying slope m family[b] log(1 + mutability b

Finding 1 . 3 :
PS and ST are specialized fault localization techniques, which may work well only on a small subset of bugs, and thus often return short lists of suspicious locations.

Figure 6 :
Figure 6: Pairwise visual comparison of four FL families for effectiveness.Each point in the scatterplot at row labeled R and column labeled C has coordinates (x, y), where x is the generalized E inspect rank I b (C) of FL techniques in family C and y is the rank I b (R) of FL techniques in family R on the same bug b.Thus, points below (resp.above) the diagonal line denote bugs on which R had better (resp.worse) E inspect ranks.Points are colored according to the bug's project category.The opposite box at row labeled C and column labeled R displays three statistics (correlation, p-value, and effect size, see Section 4.6) quantitatively comparing the same average generalized E inspect ranks of C and R; negative values of effect size mean that R tends to be better, and positive values mean that C tends to be better.Each bar plot on the diagonal at row F, column F is a histogram of the distribution of I b (F) for all bugs.Horizontal axes of all diagonal plots have the same E inspect scale as the bottom-right plot's (SBFL); their vertical axes have the same 0-100% scale as the top-left plot (MBFL).

Finding 1 . 5 :
All techniques in the SBFL family achieve very similar effectiveness.

Figure 7 :Figure 8 :
Figure7: Point estimates (boxes) and 95% probability intervals (lines) for the regression coefficients of model(9).The scale of the vertical axes is over standard deviation log-units.

Finding 2 . 1 :
Standalone fault localization families ordered by efficiency: ST ≫ SBFL ≫ PS > MBFL, where > means faster, and ≫ much faster.a a As we discuss at the end of Section 5.2, these results are largely expected given how the different fault localization techniques work algorithmically.

Figure 10 'Figure 9 :
Figure 10's scatterplots confirm that ST outperforms all other techniques, and that SBFL is generally secondfastest.It also shows that MBFL and PS have similar overall performance but can be slower or faster on different bugs: a narrow majority of points lies below the diagonal line in the scatterplot (meaning PS is faster than MBFL),

Figure 10 :
Figure 10: Pairwise visual comparison of four FL families for efficiency.Each point in the scatterplot at row labeled R and column labeled C has coordinates (x, y), where x is the average per-bug wall-clock running time of FL techniques in family C and y average per-bug wall-clock running time of FL techniques in family R. Points are colored according to the bug's project category.The opposite box at row labeled C and column labeled R displays three statistics (correlation, p-value, and effect size, see Section 4.6) quantitatively comparing the same per-bug average running times of C and R; negative values of effect size mean that R tends to be better, and positive values that C tends to be better.

Finding 3 . 1 :
Bugs in data science projects challenge fault localization's effectiveness (that is, they are harder to localize correctly) more than bugs in other categories of projects.

Finding 3 . 2 :
SBFL remains the most effective standalone fault localization family on all categories of projects.
Estimates and 95% probability intervals for the coefficients c family in model(10), for each FL family MBFL, PS, SBFL, and ST.Estimates and 95% probability intervals for the coefficients p family in model(10), for each FL family MBFL, PS, SBFL, and ST.Estimates and 95% probability intervals for the coefficients m family in model(10), for each FL family MBFL, PS, SBFL, and ST.

Finding 3 . 6 :
PS is the least effective on crashing bugs.

Finding 3 . 8 :
On the few bugs that it can analyze successfully, PS is the most effective standalone fault localization technique.

Finding 3 . 10 :
PS and ST are less effective on mutable bugs than on other kinds of bugs.

Finding 3 . 11 :
The relative efficiency of each fault localization family does not depend on the kinds of bugs that are analyzed.

Finding 4 . 1 :
Combined fault localization techniques AvgFL A and CombineFL A , which combine all baseline techniques, achieve better effectiveness than any other techniques.

Finding 6 . 1 :
Our experiments confirmed for Python programs most of Zou et al.[78]'s findings about fault localization techniques on Java programs.
fault localization technique identifies critical predicates: branching conditions (such as those of if and while statements) that are related to a program's failure.PS's key idea is that if forcibly changing a predicate's value turns a failing test into passing one, the predicate's location is a suspicious program entity.
For each failing test t ∈ F, PS starts from t's execution trace (the sequence of all statements executed by t), and finds t's subsequence b t 1 b t 2 . . . of branching statements.Then, by instrumenting the program p under analysis, it generates, for each branching statement b t k , a new execution of t where the predicate (branching condition) c t k evaluated by statement b t k is forcibly switched (negated) at runtime (that is, the new execution takes the other branch at b t k ).If switching predicate c t k makes the test execution pass, then c t k is a critical predicate.Finally, PS assigns a (positive) suspiciousness score to all critical predicates in all tests F: PS F (c t k ) is higher, the fewer critical predicates are evaluated between c t k and the failure location when executing t ∈ F [78].
Are fault localization techniques as effective on Python programs as they are on Java programs?

Table 1 :
bugs in category DEV, 42 bugs in category DS, and 20 bugs in category WEB.This gives us a still sizable, balanced, and representative11sample of all bugs in BUGSINPY, which we could exhaustively analyze in around two CPU-months Overview of projects in BUGSINPY.For each PROJECT, the table reports the project's overall size in KLOC (thousands of non-empty non-comment lines of code, excluding tests), the number |F| of functions (excluding test functions), the number |M| of modules (excluding test modules), the number of BUGS included in BUGSINPY, how many we selected as SUBJECTS for our experiments, the corresponding number of TESTS (i.e., test functions), their size in kLOC (TEST KLOC, thousands of non-empty non-comment lines of test code), the CATEGORY the project belongs to (CL: command line; DEV: development tools; DS: data science; WEB: web tools), and a brief DESCRIPTION of the project.Consistently with what done by the authors of BUGSINPY

Table 2 :
Selected BUGSINPY bugs used in the paper's experiments.The PROJECTs are grouped by CATEGORY; the table reports-for each project individually (column P), as well as for all projects in the category (column C)-the number of BUGS selected as SUBJECTS for our experiments, the corresponding number of TESTS (i.e., test functions), and the total number of program locations that make up the GROUND TRUTH (described in Section 4.2).

Table 3 :
An example of calculating the E inspect metric I The information used by each fault localization technique naturally captures the behavior of different kinds of faults.Stack trace fault localization analyzes the call stack after a program terminates with a crash; predicate switching targets branching conditions as program entities to perform fault localization; and MBFL crucially relies on the analysis of mutants to track suspicious locations.Correspondingly, we classify a bug b = ⟨p − b , p + b , F b , P b ⟩ as: Crashing bug if any failing test in F b terminates abruptly with an unexpected uncaught exception.Predicate bug if any faulty entity in the ground truth F (b) includes a branching predicate (such as an if or while condition).Mutable bug if any of the mutants generated by MBFL's mutation operators mutates any locations in the ground truth F (b). Precisely, a bug b's mutability is the percentage of all mutants of p − b that mutate locations in F (b); and b is mutable if its mutability is greater than zero.
(6)ℓ, ⟨ℓ 1 , s 1 ⟩ ...⟨ℓ n , s n ⟩) for a list of 10 suspicious locations ℓ 1 , ..., ℓ 10 ordered by their decreasing suspiciousness scores s 1 , ..., s 10 .For each location ℓ, the table reports its suspiciousness score s, and whether ℓ is a faulty location ℓ ∈ F (b); based on this ranking of locations, it also shows the lowest rank start(ℓ) of the first location whose score is equal to ℓ's, the number ties(ℓ) of locations whose score is equal to ℓ's, the number of faulty locations among these, and the corresponding E inspect value I b (ℓ, L)-computed according to(6).Figure 4: Classification of the 135 BUGSINPY bugs used in our experiments into three categories.4.3 Classification of Faults Bug kind.
number |B| of bugs in B. I B (L) is L's average generalized E inspect rank of bugs in B. And E B (L) is L's average exam score of bugs in B (thus ignoring bugs that L cannot localize).listlength.The |L b | metric is simply the number of suspicious locations output by FL technique L when run on bug b; and |L B | is the average of |L b | for all bugs in B.The location list length metric is not, strictly speaking, a measure of effectiveness; rather, it complements the information provided by other measures of effectiveness, as it gives an idea of how much output a technique produces to the user.All else being equal, a shorter location list length is preferable-provided it is not empty.In practice, we'll compare the location list length to other metrics of effectiveness, in order to better understand the trade-offs offered by each FL technique.

Table 4 :
Tests used in the fault localization experiments with the bugs of Table 2. Following the procedure described in Section 4.7, we selected s b tests out of the t b BUGSINPY tests for each bug b among the 135 bugs used in our experiments.For each PROJECT, the table reports the MINimum, MEDIAN, MEAN, and MAXimum percentage 100 • s b /t b % of selected tests among bugs b in the project (columns P); similarly, columns # report the same statistics the MINimum, MEDIAN, MEAN, and MAXimum number of selected tests s b among all bug b in the project.Finally, columns C report the same statistics among all bugs in projects of the same CATEGORY; and the bottom row reports the overall statistics among all 135 bugs.

Table 5 :
Effectiveness of standalone fault localization techniques at the statement-level granularity on all 135 selected bugs B. Each row reports a TECHNIQUE L's average generalized E inspect rank I B (L)

Table 6 :
Efficiency of fault localization techniques at the statement-level granularity.Each row reports a TECHNIQUE L's per-bug average wall-clock running time T X (L) in seconds on: ALL 135 bugs selected for the experiments (X = B); CRASHING, PREDICATE-related, and MUTABLE bugs; bugs in projects of category CL, DEV, DS, and WEB (see Section 4.3).The running time is the same for all techniques of the same FAMILY.Highlighted numbers denote the fastest technique for bugs in each group.

Table 9 :
Effectiveness of fault localization techniques at the function-level granularity on all 135 selected bugs B. The table reports the same metrics as Table5and Table8but targeting functions as suspicious entities.Highlighted numbers denote the best technique according to each metric.
FAMILY TECHNIQUE

Table 10 :
Effectiveness of fault localization techniques at the module-level granularity on all 135 selected bugs B. The table reports the same metrics as Table5and Table8but targeting modules (files in Python) as suspicious entities.Highlighted numbers denote the best technique according to each metric.ST is more effective than PS both at the function-level and module-level granularity; however, it remains considerably less effective than other fault localization techniques even at these coarser granularities.

Table 11 :
Python Java Python Java Python Java Python Java Python Java Effectiveness of fault localization techniques in Python and Java.Each row reports a TECHNIQUE L's CombineFL A > CombineFL S > SBFL > MBFL ≫ PS, ST SBFL ≥ CombineFL S > PS > MBFL > CombineFL A

Table 12 :
[78]mparison of findings about fault localization in Python vs. Java.Each row lists a FINDING discussed in the present paper or in Zou et al.[78], whether the finding was confirmed or refuted for PYTHON and for JAVA, and the reported evidence that confirms or refutes it (a reference to a numbered finding, Figure,or Table in our paper or in

Table 13 :
[78,mparison of MBFL's and SBFL's effectiveness on Python and Java predicate-related bugs.The left part of the table reports a portion of the same data as Table7: each column @k% reports the average percentage of the 52 predicate bugs in BUGSINPY Python projects used in our experiments that techniques in the MBFL or SBFL family ranked within the top-k.The right part of the table averages some of the data in[78, Table 5] by family: each column @k% reports the average percentage of the 115 predicate bugs in Defects4J Java projects used in Zou et al.'s experiments that techniques in the MBFL or SBFL family ranked within the top-k.Highlighted numbers denote each language's best family according to each metric.
Comparison of SBFL techniques on Python vs. Java programs.Each row compares the same SBFL TECHNIQUE L applied to PYTHON and to JAVA programs, reporting the p-value of a Wilcoxon rank-sum test, and Cliff's delta EFFECT size; a letter gives a qualitative assessment of the effect size: N for negligible, S for small, M for medium, and L for large.The data for THIS PAPER is each technique L's exam score E (L) and E inspect rank I(L) for each bug among all 135 Python bugs used in the rest of the paper's experiments, and for each Java bug in Zou et al.'s replication package data M (a)

Table 3 (
best-case debugging scenario top-k ranks).Pairwise comparison of SBFL techniques according to exam score.Each row compares the exam scores of two TECHNIQUEs L 1 and L 2 for significant differences, reporting the p-value of a Wilcoxon signed-rank test, and Cliff's delta EFFECT size; a letter gives a qualitative assessment of the effect size: N for negligible, S for small, M for medium, and L for large.The data for THIS PAPER is each technique L's exam score E (L) for each bug among all 135 Python bugs used in the rest of the paper's experiments; to reflect N (b)