1 Introduction

It is commonplace that debugging is an activity that takes up a disproportionate amount of time and resources in software development (McConnell 2004). This also explains the popularity of fault localization as a research subject in software engineering: identifying locations in a program’s source code that are implicated in some observed failures (such as crashes or other kinds of runtime errors) is a key step of debugging. This paper contributes to the empirical knowledge about the capabilities of fault localization techniques, targeting the Python programming language.

Despite the massive amount of work on fault localization (see Section 3) and the popularity of the Python programming language,Footnote 1Footnote 2 most empirical studies of fault localization target languages like Java or C. This leaves open the question of whether Python’s characteristics—such as the fact that it is dynamically typed, or that it is dominant in certain application domains such as data science—affect the capabilities of classic fault localization techniques—developed and tested primarily on different kinds of languages and programs.

This paper fills this knowledge gap: to our knowledge, it is the first multi-family large-scale empirical study of fault localization in real-world Python programs. The starting point is Zou et al.’s recent extensive study (Zou et al. 2021) of fault localization for Java. This paper’s main contribution is a differentiated conceptual replication (Juzgado and Gómez 2010) of Zou et al.’s study, sharing several of its features: i) it experiments with several different families (spectrum-based, mutation-based, predicate switching, and stack-trace-based) of fault localization techniques; ii) it targets a large number of faults in real-world projects (135 faults in 13 projects) ; iii) it studies fault localization effectiveness at different granularities (statement, function, and module); iv) it considers combinations of complementary fault localization techniques. The fundamental novelty of our replication is that it targets the Python programming language; furthermore, i) it analyzes fault localization effectiveness of different kinds of faults and different categories of projects; ii) it estimates the contributions of different fault localization features by means of regression statistical models; iii) it compares its main findings for Python to Zou et al.’s (2021) for Java.

The main findings of our Python fault localization study are as follows:

  1. 1.

    Spectrum-based fault localization techniques are the most effective, followed by mutation-based fault localization techniques.

  2. 2.

    Predicate switching and stack-trace fault localization are considerably less effective, but they can work well on small sets of faults that match their characteristics.

  3. 3.

    Stack-trace is by far the fastest fault localization technique, predicate switching and mutation-based fault localization techniques are the most time consuming.

  4. 4.

    Bugs in data-science related projects tend to be harder to localize than those in other categories of projects.

  5. 5.

    Combining fault localization techniques boosts their effectiveness with only a modest hit on efficiency.

  6. 6.

    The main findings about relative effectiveness still hold at all granularity levels.

  7. 7.

    Most of Zou et al. (2021)’s findings about fault localization in Java carry over to Python.

A practical challenge to carry out a large-scale fault localization study of Python projects was that, at the time of writing, there were no open-source tools that support a variety of fault localization techniques for this programming language. Thus, to perform this study, we implemented FauxPy: a fault-localization tool for Python that supports seven fault localization techniques in four families, is highly configurable, and works with the most common Python unit testing frameworks (such as Pytest and Unittest). The present paper is not a paper about FauxPy, which we plan to present in detail in a separate publication. Nevertheless, we briefly discuss the key features of FauxPy, and make the tool available as part of this paper’s replication package—which also includes all the detailed experimental artifacts and data that support further independent analysis and replicability.

The rest of the paper is organized as follows. Section 2 presents the fault localization techniques that fall within the scope of the empirical study, and outlines FauxPy’s features. Section 3 summarizes the most relevant related work in fault localization, demonstrating how Python is underrepresented in this area. Section 4 presents in detail the paper’s research questions, and the experimental methodology that we followed to answer them. Section 5 details the experimental results for each investigated research question, and presents any limitations and threats to the validity of the findings. Section 6 concludes with a high-level discussion of the main results, and of possible avenues for future work.

Replication Package

For reproducibility, all experimental artifacts of this paper’s empirical study, and the implementation of the FauxPy tool, are available:

https://doi.org/10.6084/m9.figshare.23254688

2 Fault Localization and FauxPy

Fault localization techniques (Zeller 2009; Wong et al. 2016) relate program failures (such as crashes or assertion violations) to faulty locations in the program’s source code that are responsible for the failures. Concretely, a fault localization technique L assigns a suspiciousness score \(L_T(e)\) to any program entity e—usually, a location, function, or module—given test inputs T that trigger a failure in the program. The suspiciousness score \(L_T(e)\) should be higher the more likely e is the location of a fault that is ultimately responsible for the failure. Thus, a list of all program entities \(e_1, e_2, \ldots \) ordered by decreasing suspiciousness score \(L_T(e_1) \ge L_T(e_2) \ge \ldots \) is fault localization technique L’s overall output.

Let \(T = P \cup F\) be a set of tests partitioned into passing P and failing F, such that \(F \ne \varnothing \)—there is at least one failing test—and all failing tests originate in the same fault. Tests T and a program p are thus the target of a single fault localization run. Then, fault localization techniques differ in what kind of information they extract from T and p to compute suspiciousness scores. A fault localization family is a group of techniques that combine the same kind of information according to different formulas. Sections 2.12.4 describe four common FL families that comprise a total of seven FL families. As Section 2.5 further explains, a FL technique’s granularity denotes the kind of program entities it analyzes for suspiciousness—from individual program locations to functions or files/modules. Some FL techniques are only defined for a certain granularity level, whereas others can be applied to different granularities.

While FL techniques are usually applicable to any programming language, we could not find any comprehensive implementation of the most common fault localization techniques for Python at the time of writing. Therefore, we implemented FauxPy—an automated fault localization tool for Python implementing several widely used techniques—and used it to perform the empirical study described in the rest of the paper. Section 2.6 outlines FauxPy’s main features and some details of its implementation.

2.1 Spectrum-Based Fault Localization

Techniques in the spectrum-based fault localization (SBFL) family compute suspiciousness scores based on a program’s spectra (Reps et al. 1997)—in other words, its concrete execution traces. The key heuristics of SBFL techniques is that a program entity’s suspiciousness is higher the more often the entity is covered (reached) by failing tests and the less often it is covered by passing tests. The various techniques in the SBFL family differ in what formula they use to assign suspiciousness scores based on an entity’s coverage in passing and failing tests.

Fig. 1
figure 1

SBFL formulas to compute the suspiciousness score of an entity e given tests \(T = P \cup F\) partitioned into passing P and failing F. All formulas compute a score that is higher the more failing tests \(F^+(e)\) cover e, and lower the more passing tests \(P^+(e)\) cover e—capturing the basic heuristics of SBFL

Given tests \(T = P \cup F\) as above, and a program entity e: i) \(P^+(e)\) is the number of passing tests that cover e; ii) \(P^-(e)\) is the number of passing tests that do not cover e; iii) \(F^+(e)\) is the number of failing tests that cover e; iv) and \(F^-(e)\) is the number of failing tests that do not cover e. Figure 1 shows how Tarantula (Jones and Harrold 2005), Ochiai (Abreu et al. 2007), and DStar (Wong et al. 2014)—three widely used SBFL techniques (Pearson et al. 2017)—compute suspiciousness scores given this coverage information. DStar’s formula (3), in particular, takes the second power of the numerator, as recommended by other empirical studies (Zou et al. 2021; Wong et al. 2014).Footnote 3

2.2 Mutation-Based Fault Localization

Techniques in the mutation-based fault localization (MBFL) family compute suspiciousness scores based on mutation analysis (Jia and Harman 2011), which generates many mutants of a program p by applying random transformations to it (for example, change a comparison operator < to \(\le \) in an expression). A mutant m of p is thus a variant of p whose behavior differs from p’s at, or after, the location where m differs from p. The key idea of mutation analysis is to collect information about p’s runtime behavior based on how it differs from its mutants’. Accordingly, when a test t behaves differently on p than on m (for example, p passes t but m fails it), we say that t kills m.

Fig. 2
figure 2

MBFL formulas to compute the suspiciousness score of a mutant m given tests \(T = P \cup F\) partitioned into passing P and failing F. All formulas compute a score that is higher the more failing tests \(F^k(m)\) kill m, and lower the more passing tests \(P^k(m)\) kill m—capturing the basic heuristics of mutation analysis. On the right, MBFL formulas to compute the suspiciousness score of a program entity e by aggregating the suspiciousness score of all mutants \(m \in M\) that modified e in the original program

To perform fault localization on a program p, MBFL techniques first generate a large number of mutants \(M = \{ m_1, m_2, \ldots \}\) of p by systematically applying each mutation operator to each statement in p that is executed in any failing test F. Then, given tests \(T = P \cup F\) as above, and a mutant \(m \in M\): i) \(P^k(m)\) is the number of tests that p passes but m fails (that is, the tests in P that kill m); ii) \(F^k(m)\) is the number of tests that p fails but m passes (that is, the tests in F that kill m); iii) and \(F^k_{\sim }(m)\) is the number of tests that p fails and behave differently on m, either because they pass on m or because they still fail but lead to a different stack trace (this is a weaker notion of tests that kill m (Papadakis and Le Traon 2015)). Figure 2 shows how Metallaxis (Papadakis and Le Traon 2015) and Muse (Moon et al. 2014)—two widely used MBFL techniques—compute suspiciousness scores of each mutant in M.

Metallaxis’s formula (4) is formally equivalent to Ochiai’s—except that it is computed for each mutant and measures killing tests instead of covering tests. In Muse’s formula (5), \(\sum _{n \in M} F^k(n)\) is the total number of failing tests in F that kill any mutant in M, and \(\sum _{n \in M} P^k(n)\) is the total number of passing tests in P that kill any mutant in M (these are called \( f2p \) and \( p2f \) in Muse’s paper (Moon et al. 2014)).

Finally, MBFL computes a suspiciousness score for a program entity e by aggregating the suspiciousness scores of all mutants that modified e in the original program p; when this is the case, we say that a mutant m mutates e. The right-hand side of Fig. 2 shows Metallaxis’s and Muse’s suspiciousness formulas for entities: Metallaxis (4) takes the largest (maximum) mutant score, whereas Muse (5) takes the average (mean) of the mutant scores.

2.3 Predicate Switching

The predicate switching (PS) (Zhang et al. 2006) fault localization technique identifies critical predicates: branching conditions (such as those of if and while statements) that are related to a program’s failure. PS’s key idea is that if forcibly changing a predicate’s value turns a failing test into passing one, the predicate’s location is a suspicious program entity.

For each failing test \(t \in F\), PS starts from t’s execution trace (the sequence of all statements executed by t), and finds t’s subsequence \(b^t_1\,b^t_2\,\ldots \) of branching statements. Then, by instrumenting the program p under analysis, it generates, for each branching statement \(b^t_k\), a new execution of t where the predicate (branching condition) \(c^t_k\) evaluated by statement \(b^t_k\) is forcibly switched (negated) at runtime (that is, the new execution takes the other branch at \(b^t_k\)). If switching predicate \(c^t_k\) makes the test execution pass, then \(c^t_k\) is a critical predicate. Finally, PS assigns a (positive) suspiciousness score to all critical predicates in all tests F: \(\textsf{PS}_F(c_k^t)\) is higher, the fewer critical predicates are evaluated between \(c_k^t\) and the failure location when executing \(t \in F\) (Zou et al. 2021).Footnote 4 For example, the most suspicious program entity e will be the location of the last critical predicate evaluated before any test failure.

PS has some distinctive features compared to other FL techniques. First, it only uses failing tests for its dynamic analysis; any passing tests P are ignored. Second, the only program entities it can report as suspicious are locations of predicates; thus, it usually reports a shorter list of suspicious locations than SBFL and MBFL techniques. Third, while MBFL mutates program code, PS dynamically mutates individual program executions. For example, suppose that a loop while \(\;c\): B executes its body B twice—and hence, the loop condition c is evaluated three times—in a failing test. Then, PS will generate three variants of this test execution: i) one where the loop body never executes (c is false the first time it is evaluated); ii) one where the loop body executes once (c is false the second time it is evaluated); iii) one where the loop body executes three or more times (c is true the third time it is evaluated).

2.4 Stack Trace Fault Localization

When a program execution fails with a crash (for example, an uncaught exception), the language runtime usually prints its stack trace (the chain of methods active when the crash occurred) as debugging information to the user. In fact, it is known that stack trace information helps developers debug failing programs (Bettenburg et al. 2008); and a bug is more likely to be fixed if it is close to the top of a stack trace (Schroter et al. 2010). Based on these empirical findings, (Zou et al. 2021) proposed the stack trace fault localization technique (ST), which uses the simple heuristics of assigning suspiciousness based on how close a program entity is to the top of a stack trace.

Concretely, given a failing test \(t \in F\), its stack trace is a sequence \(f_1\,f_2\,\ldots \) of the stack frames of all functions that were executing when t terminated with a failure, listed in reverse order of execution; thus, \(f_1\) is the most recently called function, which was directly called by \(f_2\), and so on. ST assigns a (positive) suspiciousness score to any program entity e that belongs to any function \(f_k\) in t’s stack trace: \(\textsf{ST}_t(e) = 1/k\), so that e’s suspiciousness is higher, the closer to the failure e’s function was called.Footnote 5 In particular, the most suspicious program entities will be all those in the function \(f_1\) called in the top stack frame. Then, the overall suspiciousness score of e is the maximum in all failing tests F: \(\textsf{ST}_F(e) = \max _{t \in F} \textsf{ST}_t(e)\).

2.5 Granularities

Fault localization granularity refers to the kinds of program entity that a FL technique ranks. The most widely studied granularity is statement-level, where each statement in a program may receive a different suspiciousness score (Pearson et al. 2017; Wong et al. 2014). However, coarser granularities have also been considered, such as function-level (also called method-level) (B. Le et al. 2016; Xuan and Monperrus 2014) and module-level (also called file-level) (Saha et al. 2013; Zhou et al. 2012).

In practice, implementations of FL techniques that support different levels of granularity focus on the finest granularity (usually, statement-level granularity), whose information they use to perform FL at coarser granularities. Namely, the suspiciousness of a function is the maximum suspiciousness of any statements in its definition; and the suspiciousness of a module is the maximum suspiciousness of any functions belonging to it.Footnote 6

2.6 FauxPy: Features and Implementation

Despite its popularity as a programming language, we could not find off-the-shelf implementations of fault localization techniques for Python at the time of writing (Sarhan and Beszédes 2022). The only exception is CharmFL (Idrees Sarhan et al. 2021)—a plugin for the PyCharm IDE—which only implements SBFL techniques. Therefore, to conduct an extensive empirical study of FL in Python, we implemented FauxPy: a fault localization tool for Python programs.

FauxPy supports all seven FL techniques described in Sections 2.12.4; it can localize faults at the level of statements, functions, or modules (Section 2.5). To make FauxPy a flexible and extensible tool, easy to use with a variety of other commonly used Python development tools, we implemented it as a stand-alone command-line tool that works with tests in the formats supported by Pytest, Unittest, and Hypothesis (MacIver et al. 2019)—three popular Python testing frameworks.

While running, FauxPy stores intermediate analysis data in an SQLite database; upon completing a FL localization run, it returns to the user a human-readable summary—including suspiciousness scores and ranking of program entities. The database improves performance (for example by caching intermediate results) but also facilitates incremental analyses—for example, where we provide different batches of tests in different runs.

FauxPy’s implementation uses Coverage.py (Batchelder 2023)—a popular code-coverage measurement library—to collect the execution traces needed for SBFL and MBFL. It also uses the state-of-the-art mutation-testing framework Cosmic Ray (2019) to generate mutants for MBFL; since Cosmic Ray is easily configurable to use some or all of its mutation operators—or even to add new user-defined mutation operators—FauxPy’s MBFL implementation is also fully configurable. To implement PS in FauxPy, we developed an instrumentation library that can selectively change the runtime value of predicates in different runs as required by the PS technique. The implementation of FauxPy is available as open-source (see this paper’s replication package).

3 Related Work

Fault localization has been an intensely researched topic for over two decades, whose popularity does not seem to wane (Wong et al. 2016). This section summarizes a selection of studies that are directly relevant for the paper; Wong’s recent survey (Wong et al. 2016) provides a broader summary for interested readers.

Spectrum-based Fault Localization

The Tarantula SBFL technique (Jones and Harrold 2005) was one of the earliest, most influential FL techniques, also thanks to its empirical evaluation showing it is more effective than other competing techniques (Renieres and Reiss 2003; Cleve and Zeller 2005). The Ochiai SBFL technique (Abreu et al. 2007) improved over Tarantula, and it often still is considered the “standard” SBFL technique.

These earlier empirical studies (Jones and Harrold 2005; Abreu et al. 2007), as well as other contemporary and later studies of FL (Papadakis and Le Traon 2015), used the Siemens suite (Hutchins et al. 1994): a set of seven small C programs with seeded bugs. Since then, the scale and realism of FL empirical studies has significantly improved over the years, targeting real-world bugs affecting projects of realistic size. For example, Ochiai’s effectiveness was confirmed (Le et al. 2013) on a collection of more realistic C and Java programs (Do et al. 2005). When (Wong et al. 2014) proposed DStar, a new SBFL technique, they demonstrated its capabilities in a sweeping comparison involving 38 other SBFL techniques (including the “classic” Tarantula and Ochiai). In contrast, numerous empirical results about fault localization in Java based on experiments with artificial faults were found not to hold to experiments with real-world faults (Pearson et al. 2017) using the Defects4J curated collection (Just et al. 2014).

Mutation-based Fault Localization

With the introduction of novel fault localization families—most notably, MBFL—empirical comparison of techniques belonging to different families became more common (Moon et al. 2014; Papadakis and Le Traon 2015; Pearson et al. 2017; Zou et al. 2021). The Muse MBFL technique was introduced to overcome a specific limitation of SBFL techniques: the so-called “tie set problem”. This occurs when SBFL assigns the same suspiciousness score to different program entities, simply because they belong to the same simple control-flow block (see Section 2.1 for details on how SBFL works). Metallaxis-FL (Papadakis and Le Traon 2015) (which we simply call “Metallaxis” in this paper) is another take on MBFL that can improve over SBFL techniques.

The comparison between MBFL and SBFL is especially delicate given how MBFL works. As demonstrated by (Pearson et al. 2017), MBFL’s effectiveness crucially depends on whether it is applied to bugs that are “similar” to those introduced by its mutation operators. This explains why the MBFL studies targeting artificially seeded faults (Moon et al. 2014; Papadakis and Le Traon 2015) found MBFL to outperform SBFL; whereas studies targeting real-world faults (Pearson et al. 2017; Zou et al. 2021) found the opposite to be the case—a result also confirmed by the present paper in Section 5.1.

Mutation Testing

MBFL techniques rely on mutation testing to generate mutants of a faulty program that may help locate the fault. Therefore, the selection of mutation operators that are used for mutation testing impacts the effectiveness of MBFL techniques. Research in mutation testing has grown considerably in the last decade, developing a large variety of mutation operators tailored to specific programming languages, applications, and faults (Papadakis et al. 2019). Despite these recent developments, the fundamental set of mutation operators introduced in Offut et al.’s seminal work (Offutt et al. 1996) remains the basis of basically every application to mutation testing. These fundamental operators generate mutants by modifying or removing arithmetic, logical, and relational operators, as well as constants and variables in a program, and hence are widely applicable and domain-agnostic. Notably, the Cosmic Ray (2019) Python mutation testing framework (used in our implementation of FauxPy), the two other popular Python mutation testing frameworks MutPy (Derezińska and Hałas 2014) and mutmut,Footnote 7 as well as the popular Java mutation testing frameworks Pitest,Footnote 8 MuJava (Ma et al. 2005) and Major (Just 2014) (the latter used in Zou et al.’s MBFL experiments (Zou et al. 2021)) all offer Offut et al.’s fundamental operators. This helps make experiments with mutation testing techniques meaningfully comparable.

Empirical Comparisons

This paper’s study design is based on Zou et al.’s empirical comparison of fault localization on Java programs (Zou et al. 2021). We chose their study because it is fairly recent (it was published in 2021), it is comprehensive (it targets 11 fault localization techniques in seven families, as well as combinations of some of these techniques), and it targets realistic programs and faults (357 bugs in five projects from the Defects4J curated collection).

Ours is a differentiated conceptual replication (Juzgado and Gómez 2010) of Zou et al.’s study (Zou et al. 2021). We target a comparable number of subjects (135 BugsInPy (Widyasari et al. 2020) bugs vs. 357 Defects4J (Just et al. 2014) bugs) from a wide selection of projects (13 real-world Python projects vs. five real-world Java projects). We study (Zou et al. 2021)’s four main fault localization families SBFL, MBFL, PS, and ST, but we exclude three other families that featured in their study: DS (dynamic slicing (Hammacher 2008)), IRBFL (Information retrieval-based fault localization (Zhou et al. 2012)), and HBFL (history-based fault localization (Rahman et al. 2011)). IRBFL and HBFL were shown to be scarcely effective by Zhou et al. (2012), and rely on different kinds of artifacts that may not always be available when dynamically analyzing a program as done by the other “mainstream” fault localization techniques. Namely, IRBFL analyzes bug reports, which may not be available for all bugs; HBFL mines commit histories of programs. In contrast, our study only includes techniques that solely rely on tests to perform fault localization; this help make a comparison between techniques consistent. Finally, we excluded DS for practical reasons: implementing it requires accurate data- and control-dependency static analyses (Zeller 2009). These are available in languages like Java through widely used frameworks like Soot (Vallée-Rai et al. 1999; Lam et al. 2011); in contrast, Python currently offers few mature static analysis tools (e.g, Scalpel (Li et al. 2022)), none with the features required to implement DS. Unfortunately, dynamic slicing has been implemented for Python in the past (Chen et al. 2014) but no implementation is publicly available; and building it from scratch is outside the present paper’s scope.

Python Fault Localization

Despite Python’s popularity as a programming language, the vast majority of fault localization empirical studies target other languages—mostly C, C++, and Java. To our knowledge, CharmFL (Szatmári et al. 2022; Idrees Sarhan et al. 2021) is the only available implementation of fault localization techniques for Python; the tool is limited to SBFL techniques. We could not find any realistic-size empirical study of fault localization using Python programs comparing techniques of different families. This gap in both the availability of tools (Sarhan and Beszédes 2022) and the empirical knowledge about fault localization in Python motivated the present work.

Note that numerous recent empirical studies looked into fault localization for deep-learning models implemented in Python (Eniser et al. 2019; Guo et al. 2020; Zhang et al. 2020, 2021; Schoop et al. 2021; Wardat et al. 2021). This is a very different problem, using very different techniques, than “classic” program-based fault localization, which is the topic of our paper.

Deep learning-based Fault Localization

Deep learning models have recently been applied to the software fault localization problem. The key idea of techniques such as DeepFL (Li et al. 2019), GRACE (Lou et al. 2021), and DEEPRL4FL (Li et al. 2021) is to train a deep learning model to identify suspicious locations, giving it as input coverage information, as well as other encoded information about the source code of the faulty programs (such as the data and control-flow dependencies). While these approaches are promising, we could not include them in our empirical study since they do not have the same level of maturity as the other “classic” FL techniques we considered. First, DeepFL and GRACE only work at function-level granularity, whereas the bulk of FL research targets statement-level granularity. Second, there are no reference implementations of techniques such as DEEPRL4FL that we can use for our experiments.Footnote 9 Third, the performance of a deep learning-based technique usually depends on the training set. Fourth, training a deep learning model is usually a time consuming process; how to account for this overhead when comparing efficiency is tricky.

Nevertheless, our empirical study does feature one FL technique that is based on machine learning: CombineFL, which is Zou et al.’s application of learning to rank to fault localization (Zou et al. 2021). The same paper also discusses how CombineFL outperforms other state-of-the-art machine learning-based fault localization techniques such as MULTRIC (Li and Zhang 2017), Savant (B. Le et al. 2016), TraPT (Li and Zhang 2017), and FLUCCS (Sohn and Yoo 2017). Therefore, CombineFL is a valid representative of the capabilities of pre-deep learning machine learning FL techniques.

Python vs. Java SBFL Comparison

To our knowledge, Widyasari et al.’s recent empirical study of spectrum-based fault localization (Widyasari et al. 2022) is the only currently available large-scale study targeting real-world Python projects. Like our work, they use the bugs in the BugsInPy curated collection as experimental subjects (Widyasari et al. 2020); and they compare their results to those obtained by others for Java (Pearson et al. 2017). Besides these high-level similarities, the scopes of our study and Widyasari et al.’s are fundamentally different: i) We are especially interested in comparing fault localization techniques in different families; they consider exclusively five spectrum-based techniques, and drill down into the relative performance of these techniques. ii) Accordingly, we consider orthogonal categorization of bugs: we classify bugs (see Section 4.3) according to characteristics that match the capabilities of different fault-localization families (e.g., stack-trace fault localization works for bugs that result in a crash); they classify bugs according to syntactic characteristics (e.g., multi-line vs. single-line patch). iii) Most important, even though both our paper and Widyasari et al.’s compare Python to Java, the framing of our comparisons is quite different: in Section 5.6, we compare our findings about fault localization in Python to Zou et al. (2021)’s findings about fault localization in Java; for example, we confirm that SBFL techniques are generally more effective than MBFL techniques in Python, as they were found to be in Java. In contrast, Widyasari et al. directly compare various SBFL effectiveness metrics they collected on Python programs against the same metrics (Pearson et al. 2017) collected on Java programs; for example, Widyasari et al. report that the percentage of bugs in BugsInPy that their implementation of the Ochiai SBFL technique correctly localized within the top-5 positions is considerably lower than the percentage of bugs in Defects4J that Pearson et al.’s implementation of the Ochiai SBFL technique correctly localized within the top-5.

It is also important to note that there are several technical differences between ours and Widyasari et al.’s methodology. First, we handle ties between suspiciousness scores by computing the \(E_{\textrm{inspect}}\) rank (described in Section 4.5); whereas they use average rank (as well as other effectiveness metrics). Even though we also take our subjects from BugsInPy, we carefully selected a subset of bugs that are fully analyzable on our infrastructure with all fault localization techniques we consider (Section 4.1, Section 4.7); whereas they use all BugsInPy available bugs. The selection of subjects is likely to impact the value of some metrics more than others (see Section 4.5); for example, the exam score is undefined for bugs that a fault localization technique cannot localize, whereas the top-k counts are lower the more faults cannot be localized. These and numerous other differences make our results and Widyasari et al.’s incomparable and mostly complementary. A replication of their comparison following our methodology is an interesting direction for future work, but clearly outside the present paper’s scope. In Section 6.1 we present some additional data, and outline a few directions for future work that are directly inspired by Widyasari et al.’s study (Widyasari et al. 2022).

4 Experimental Design

Our experiments assess and compare the effectiveness and efficiency of the seven FL techniques described in Section 2, as well as of their combinations, on real-world Python programs and faults. To this end, we target the following research questions:

RQ1.:

How effective are the fault localization techniques? RQ1 compares fault localization techniques according to how accurately they identify program entities that are responsible for a fault.

RQ2.:

How efficient are the fault localization techniques? RQ2 compares fault localization techniques according to their running time.

RQ3.:

Do fault localization techniques behave differently on different faults? RQ3 investigates whether the fault localization techniques’ effectiveness and efficiency depend on which kinds of faults and programs it analyzes.

RQ4.:

Does combining fault localization techniques improve their effectiveness? RQ4 studies whether combining the information of different fault localization techniques for the same faults improves the effectiveness compared to applying each technique in isolation.

RQ5.:

How does program entity granularity impact fault localization effectiveness? RQ5 analyzes the relation between effectiveness and granularity: does the relative effectiveness of fault localization techniques change as they target coarser-grained program entities?

RQ6.:

Are fault localization techniques as effective on Python programs as they are on Java programs? RQ6 compares our overall results to (Zou et al. 2021)’s, exploring similarities and differences between Java and Python programs.

Table 1 Overview of projects in BugsInPy
Table 2 Selected BugsInPy bugs used in the paper’s experiments

4.1 Subjects

To have a representative collection of realistic Python bugs,Footnote 10 we used BugsInPy (Widyasari et al. 2020), a curated dataset of real bugs collected from real-world Python projects, with all the information needed to reproduce the bugs in controlled experiments. Table 1 overviews BugsInPy’s 501 bugs from 17 projects.

Project Category

Columns category in Tables 1 and 2 partition all BugsInPy projects into four non-overlapping categories:

Command line (cl) projects consist of tools mainly used through their command line interface.

Development (dev) projects offer libraries and utilities useful to software developers.

Data science (ds) projects consist of machine learning and numerical computation frameworks.

Web (web) projects offer libraries and utilities useful for web development.

We classified the projects according to their description in their respective repositories, as well as how they are presented in BugsInPy. Like any classification, the boundaries between categories may be somewhat fuzzy, but the main focus of most projects is quite obvious (such as ds for keras and pandas, or cl for youtube-dl).

Unique Bugs

Each bug \(b = \langle p_b^-, p_b^+, F_b, P_b\rangle \) in BugsInPy consists of: i) a faulty version \(p_b^-\) of the project, such that tests in \(F_b\) all fail on it (all due to the same root cause); ii) a fixed version \(p_b^+\) of the project, such that all tests in \(F_b \cup P_b\) pass on it; iii) a collection of failing \(F_b\) and passing \(P_b\) tests, such that tests in \(P_b\) pass on both the faulty \(p_b^-\) and fixed \(p_b^+\) versions of the project, whereas tests in \(F_b\) fail on the faulty \(p_b^-\) version and pass on the fixed \(p_b^+\) version of the project.

Bug Selection

Despite BugsInPy’s careful curation, several of its bugs cannot be reproduced because their dependencies are missing or no longer available; this is a well-known problem that plagues reproducibility of experiments involving Python programs (Mukherjee et al. 2021). In order to identify which BugsInPy bugs were reproducible at the time of our experiments on our infrastructure, we took the following steps for each bug b:

  1. i)

    Using BugsInPy’s scripts, we generated and executed the faulty \(p_b^-\) version and checked that tests in \(F_b\) fail whereas tests in \(P_b\) pass on it; and we generated and executed the fixed \(p_b^+\) version and checked that all tests in \(F_b \cup P_b\) pass on it. Out of all of BugsInPy’s bugs, 120 failed this step; we did not include them in our experiments.

  2. ii)

    Python projects often have two sets of dependencies (requirements): one for users and one for developers; both are needed to run fault localization experiments, which require to instrument the project code. Another 39 bugs in BugsInPy miss some development dependencies; we did not include them in our experiments.

  3. iii)

    Two bugs resulted in an empty ground truth (Section 4.2): essentially, there is no way of localizing the fault in \(p_b^-\); we did not include these bugs in our experiments.

This resulted in \(501 - 120 - 39 - 2 =\) 340 bugs in 13 projects (all but ansible, matplotlib, PySnooper, and scrapy) that we could reproduce in our experiments.

However, this is still an impractically large number: just reproducing each of these bugs in BugsInPy takes nearly a full week of running time, and each FL experiment may require to rerun the same tests several times (hundreds of times in the case of MBFL). Thus, we first discarded 27 bugs that each take more than 48 hours to reproduce. We estimate that including these 27 bugs in the experiments would have taken over 14 CPU-months just for the MBFL experiments—not counting other FL techniques, nor the time for setup and dealing with unexpected failures.

Running all the fault localization experiments for each of the remaining \(313 = 340 - 27\) bugs takes approximately eleven CPU-hours, for a total of nearly five CPU-months. We selected 135 bugs out of the 313 using stratified random sampling with the four project categories as the “strata”, picking: 43 bugs in category cl, 30 bugs in category dev, 42 bugs in category ds, and 20 bugs in category web. This gives us a still sizable, balanced, and representativeFootnote 11 sample of all bugs in BugsInPy, which we could exhaustively analyze in around two CPU-months worth of experiments. In all, we used this selection of 135 bugs as our empirical study’s subjects. Table 2 gives some details about the selected projects and their bugs.

As a side comment, note that our experiments with BugsInPy were generally more time consuming than Zou et al.’s experiments with Defects4J. For example, the average per-bug running time of MBFL in our experiments (15774 seconds in Table 6) was 3.3 times larger than in Zou et al.’s (4800 seconds in (Zou et al. 2021, Table 9)). Even more strikingly, running all fault localization experiments on the 357 Defects4J bugs took less than one CPU-month;Footnote 12 in contrast, running MBFL on just 27 “time consuming” bugs in BugsInPy takes over 14 CPU-months. This difference may be partly due to the different characteristics of projects in Defects4J vs. BugsInPy, and partly to the dynamic nature of Python (which is run by an interpreter).

Fig. 3
figure 3

An example of program edit, and the corresponding ground truth faulty locations

4.2 Faulty Locations: Ground Truth

A fault localization technique’s effectiveness measures how accurately the technique’s list of suspicious entities matches the actual fault locations in a program—fault localization’s ground truth. It is customary to use programmer-written patches as ground truth (Zou et al. 2021; Pearson et al. 2017): the program locations modified by the patches that fix a certain bug correspond to the bug’s actual fault locations.

Concretely, here is how to determine the ground truth of a bug \(b = \langle p_b^-, p_b^+, F_b, P_b\rangle \) in BugsInPy. The programmer-written fix \(p_b^+\) consists of a series of edits to the faulty program \(p_b^-\). Each edit can be of three kinds: i) add, which inserts into \(p_b^+\) a new program location; ii) remove, which deletes a program location in \(p_b^-\); iii) modify, which takes a program location in \(p_b^-\) and changes parts of it, without changing its location, in \(p_b^+\). Take, for instance, the program in Fig. 3b, which modifies the program in Fig. 3a; the edited program includes two adds (lines 22, 31), one remove (line 35), and one modify (line 28).

Bug b’s ground truth \(\mathcal {F}(b)\) is a set of locations in \(p_b^-\) that are affected by the edits, determined as follows. First of all, ignore any blank or comment lines, since these do not affect a program’s behavior and hence cannot be responsible for a fault. Then, finding the ground truth locations corresponding to removes and modifies is straightforward: a location \(\ell \) that is removed or modified in \(p_b^+\) exists by definition also in \(p_b^-\), and hence it is part of the ground truth. In Fig. 3, line 10 is modified and line 17 is removed by the edit that transforms Fig. 3a into Fig. 3b; thus 10 and 17 are part of the example’s ground truth.

Finding the ground truth locations corresponding to adds is more involved (Sarhan and Beszédes 2022), because a location \(\ell \) that is added to \(p_b^+\) does not exist in \(p_b^-\): b is a fault of omission (Pearson et al. 2017).Footnote 13 A common solution (Zou et al. 2021; Pearson et al. 2017) is to take as ground truth the location in \(p_b^-\) that immediately follows \(\ell \). In Fig. 3, line 6 corresponds to the first non-blank line that follows the assignment statement that is added at line 22 in Fig. 3b; thus 6 is part of the example’s ground truth. However, an add at \(\ell \) is actually a modification between two other locations; therefore, the location that immediately precedes \(\ell \) should also be part of the ground truth, since it identifies the same insertion location. In Fig. 3, line 1 precedes the assignment statement that is added at line 22 in Fig. 3a; thus 1 is also part of the example’s ground truth.

A location’s scope poses a final complication to determine the ground truth of adds. Consider line 31, added in Fig. 3b at the very end of function foo’s body. The (non-blank, non-comment) location that follows it in Fig. 3a is line 16; however, line 16 marks the beginning of another function bar’s definition. Function bar cannot be the location of a fault in foo, since the two functions are independent—in fact, the fact that bar’s declaration follows foo’s is immaterial. Therefore, we only include a location in the ground truth if it is within the same scope as the location \(\ell \) that has been added. If \(\ell \) is part of a function body (including methods), its scope is the function declaration; if \(\ell \) is part of a class outside any function (e.g., an attribute), its scope is the class declaration; and otherwise \(\ell \)’s scope is the module it belongs to. In Fig. 3, both lines 1 and 6 are within the same module as the added statement at line 22 in Fig. 3a. In contrast, line 16 is within a different scope than the added statement at line 31 in Fig. 3a. Therefore, lines 1, 6, and 12 are part of the ground truth, but not line 16.

Our definition of ground truth refines that used in related work (Zou et al. 2021; Pearson et al. 2017) by including the location that precedes an add, and by considering only locations within scope. We found that this definition better captures the programmer’s intent and their corrective impact on a program’s behavior.

How to best characterize bugs of omissions (fixed by an add) in fault localization remains an open issue (Sarhan and Beszédes 2022). Pearson et al.’s study (Pearson et al. 2017) proposed the first viable solution: including the location following an add. Zou et al. (2021) followed the same approach, and hence we also include the location following an add in our ground truth computation. We also noticed that, by also including the location preceding an add, and by taking scope into account, our ground truth computation becomes more comprehensive; in particular, it also works for statements added at the very end of a file—a location that has no following lines. While our approach is usually more precise, it is not necessarily the preferable alternative in all cases. Consider again, for instance, the add at line 31 in Fig. 3; if we ignored the scope (and the preceding statement), only line 16 would be included in its ground truth. If this fault localization information were consumed by a developer, it could still be useful and actionable even if it reports a line outside the scope of the actual add location: the developer would use the location as a starting point for their inspection of the nearby code; and they may prefer a smaller, if slightly imprecise, ground truth to a larger, redundant one. However, this paper’s focus is strictly evaluating the effectiveness of FL techniques as rigorously as possible—for which our stricter ground truth computation is more appropriate.

4.3 Classification of Faults

Bug Kind

The information used by each fault localization technique naturally captures the behavior of different kinds of faults. Stack trace fault localization analyzes the call stack after a program terminates with a crash; predicate switching targets branching conditions as program entities to perform fault localization; and MBFL crucially relies on the analysis of mutants to track suspicious locations.

Correspondingly, we classify a bug \(b = \langle p_b^-, p_b^+, F_b, P_b \rangle \) as:

Crashing bug if any failing test in \(F_b\) terminates abruptly with an unexpected uncaught exception.

Predicate bug if any faulty entity in the ground truth \(\mathcal {F}(b)\) includes a branching predicate (such as an if or while condition).

Mutable bug if any of the mutants generated by MBFL’s mutation operators mutates any locations in the ground truth \(\mathcal {F}(b)\). Precisely, a bug b’s mutability is the percentage of all mutants of \(p_b^-\) that mutate locations in \(\mathcal {F}(b)\); and b is mutable if its mutability is greater than zero.

The notion of crashing and predicate bugs is from Zou et al. (2021).

Fig. 4
figure 4

Classification of the 135 BugsInPy bugs used in our experiments into three categories

We introduced the notion of mutable bug to try to capture scenarios where MBFL techniques have a fighting chance to correctly localize bugs. Since MBFL uses mutant analysis for fault localization, its capabilities depend on the mutation operators that are used to generate the mutants. Therefore, the notion of mutable bugs is somewhat dependent on the applied mutation operators.Footnote 14 Our implementation of FauxPyuses the standard operators offered by the popular Python mutation testing framework Cosmic Ray (2019). As we discussed in Section 3, Cosmic Ray features a set of mutation operators that are largely similar to several other general-purpose mutation testing frameworks—all based on Offut et al.’s well known work (Offutt et al. 1996). These strong similarities between the mutation operators offered by most widely used mutation testing frameworks suggest that our definition of “mutable bug” is not strongly dependent on the specific mutation testing framework that is used. Correspondingly, bugs that we classify as “mutable” are likely to remain amenable to localization with MBFL provided one uses (at least) this standard set of core mutation operators. Conversely, we expect that devising new, specialized mutation operators may extend the number of bugs that we can classify as “mutable”, and hence that are more likely to be amenable to localization with MBFL techniques.

Figure 4 shows the kind of the 135 BugsInPy bugs we used in the experiments, consisting of 49 crashing bugs, 52 predicate bugs, 74 mutable bugs, and 34 bugs that do not belong to any of these categories.

Project Category

Another, orthogonal classification of bugs is according to the project category they belong to. We classify a bug b as a cl, dev, ds, or web bug according to the category of project (Table 2) b belongs to.

Table 3 An example of calculating the \(E_{\textrm{inspect}}\) metric \(\mathcal {I}_{b}(\ell , \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle )\) for a list of 10 suspicious locations \(\ell _1, \ldots , \ell _{10}\) ordered by their decreasing suspiciousness scores \(s_1, \ldots , s_{10}\)

4.4 Ranking Program Entities

Running a fault localization technique L on a bug b returns a list of program entities \(\ell _1, \ell _2, \ldots \), sorted by their decreasing suspiciousness scores \(s_1 \ge s_2 \ge \ldots \). The programmer (or, more realistically, a tool (Parnin and Orso 2011b; Gazzola et al. 2019)) will go through the entities in this order until a faulty entity (that is an \(\ell \in \mathcal {F}(b)\) that matches b’s ground truth) is found. In this idealized process, the earlier a faulty entity appears in the list, the less time the programmer will spend going through the list, the more effective fault localization technique L is on bug b. Thus, a program entity’s rank in the sorted list of suspicious entities is a key measure of fault localization effectiveness.

Computing a program entity \(\ell \)’s rank is trivial if there are no ties between scores. For example, consider Table 3’s first two program entities \(\ell _1\) and \(\ell _2\), with suspiciousness scores \(s_1 = 10\) and \(s_2 = 7\). Obviously, \(\ell _1\)’s rank is 1 and \(\ell _2\)’s is 2; since \(\ell _2\) is faulty (\(\ell _2 \in \mathcal {F}(b)\)), its rank is also a measure of how many entities will need to be inspected in the aforementioned debugging process.

When several program entities tie the same suspiciousness score, their relative order in a ranking is immaterial (Debroy et al. 2010). Thus, it is a common practice to give all of them the same average rank (Sarhan and Beszédes 2022; Steimann et al. 2013), capturing an average-case number of program entities inspected while going through the fault localization output list. For example, consider Table 3’s first five program entities \(\ell _1, \ldots , \ell _5\); \(\ell _3\), \(\ell _4\), and \(\ell _5\) all have the same suspiciousness score \(s = 4\). Thus, they all have the same average rank \(4 = (3 + 4 + 5) / 3\), which is a proxy of how many entities will need to be inspected if \(\ell _4\) were faulty but \(\ell _2\) were not.

Capturing the “average number of inspected entities” is trickier still if more than one entity is faulty among a bunch of tied entities. Consider now all of Table 3’s ten program entities; entities \(\ell _8\), \(\ell _9\), and \(\ell _{10}\) all have the suspiciousness score \(s = 2\); \(\ell _8\) and \(\ell _9\) are faulty, whereas \(\ell _{10}\) is not. Their average rank \(9 = (8 + 9 + 10) / 3\) overestimates the number of entities to be inspected (assuming now that these are the only faulty entities in the output), since two entities out of three are faulty, and hence it is more likely that the faulty entity will appear before rank 9.

To properly account for such scenarios,  Zou et al. (2021) introduced the \(E_{\textrm{inspect}}\) metric, which ranks a program entity \(\ell \) within a list \(\langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle \) of program entities \(\ell _1, \ldots , \ell _n\) with suspiciousness scores \(s_1 \ge \ldots \ge s_n\) as:

$$\begin{aligned} \mathcal {I}_{b}(\ell , \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle ) \quad =\quad \textsf{start}(\ell ) + \sum _{k = 1}^{\textsf{ties}(\ell ) - \textsf{faulty}(\ell )} k \,\frac{\left( {\begin{array}{c}\textsf{ties}(\ell ) - k - 1\\ \textsf{faulty}(\ell ) - 1\end{array}}\right) }{\left( {\begin{array}{c}\textsf{ties}(\ell )\\ \textsf{faulty}(\ell )\end{array}}\right) } \end{aligned}$$
(6)

In (6), \(\textsf{start}(\ell )\) is the position k of the first entity among those with the same score as \(\ell \)’s; \(\textsf{ties}(\ell )\) is the number of entities (including \(\ell \) itself) whose score is the same as \(\ell \)’s; and \(\textsf{faulty}(\ell )\) is the number of entities (including \(\ell \) itself) that tie \(\ell \)’s score and are faulty (that is \(\ell \in \mathcal {F}(b)\)). Intuitively, the \(E_{\textrm{inspect}}\) rank \(\mathcal {I}_{b}(\ell , \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle )\) is thus an average of all possible ranks where tied and faulty entities are shuffled randomly. When there are no ties, or only one entity among a group of ties is faulty, (6) coincides with the average rank.

Henceforth, we refer to a location’s \(E_{\textrm{inspect}}\) rank \(\mathcal {I}_{b}(\ell , \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle )\) as simply its rank.

Fig. 5
figure 5

Definitions of common FL effectiveness metrics. The top row shows two variants \(\mathcal {I}\), \({\widetilde{\mathcal {I}}}\) of the \(E_{\textrm{inspect}}\) metric, and the exam score \(\mathcal {E}\), for a generic bug b and fault localization technique L. The bottom row shows cumulative metrics for a set B of bugs: the “at n” metric \(L@_Bn\), and the average \({\widetilde{\mathcal {I}}}\) and \(\mathcal {E}\) metrics

Better vs. Worse Ranks

A clarification about terminology: a high rank is a rank that is close to the top-1 rank (the first rank), whereas a low rank is a rank that is further away from the top-1 rank. Correspondingly, a high rank corresponds to a small numerical ordinal value; and a low rank corresponds to a large numerical ordinal value. Consistently with this standard usage, the rest of the paper refers to “better” ranks to mean “higher” ranks (corresponding to smaller ordinals); and “worse” ranks to mean “lower” ranks (corresponding to larger ordinals).

4.5 Fault Localization Effectiveness Metrics

\(E_{\textrm{inspect}}\) Effectiveness

Building on the notion of rank—defined in Section 4.4—we measure the effectiveness of a fault localization technique L on a bug b as the rank of the first faulty program entity in the list \(L(b) = \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle \) of entities and suspiciousness scores returned by L running on b—defined as \(\mathcal {I}_{b}(L)\) in (7). \(\mathcal {I}_{b}(L)\) is L’s \(E_{\textrm{inspect}}\) rank on bug b, which estimates the number of entities in L’s one has to inspect to correctly localize b.

Generalized \(E_{\textrm{inspect}}\) effectiveness.

What happens if a FL technique L cannot localize a bug b—that is, b’s faulty entities \(\mathcal {F}(b)\) do not appear at all in L’s output? According to (6) and (7), \(\mathcal {I}_{b}(L)\) is undefined in these cases. This is not ideal, as it fails to measure the effort wasted going through the location list when using L to localize b—the original intuition behind all rank metrics. Thus, we introduce a generalization L’s \(E_{\textrm{inspect}}\) rank on bug b as follows. Given the list \(L(b) = \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle \) of entities and suspiciousness scores returned by L running on b, let \(L^\infty (b) = \langle \ell _1, s_1 \rangle \, \ldots \, \langle \ell _{n}, s_{n} \rangle \, \langle \ell _{n+1}, s_0 \rangle \langle \ell _{n+2}, s_0 \rangle \ldots \) be L(b) followed by all other entities \(\ell _{n+1}, \ell _{n+1}, \ldots \) in program \(p_b^-\) that are not returned by L, each given a suspiciousness \(s_0 < s_n\) lower than any suspiciousness scores assigned by L.

With this definition, \(\mathcal {I}_{b}(L) = {\widetilde{\mathcal {I}_{b}}}(L)\) whenever L can localize b—that is some entity from \(\mathcal {F}(b)\) appears in L’s output list. If some technique \(L_1\) can localize b whereas another technique \(L_2\) cannot, \({\widetilde{\mathcal {I}_{b}}}(L_2) > {\widetilde{\mathcal {I}_{b}}}(L_1)\), thus reflecting that \(L_2\) is worse than \(L_1\) on b. Finally, if neither \(L_1\) nor \(L_2\) can localize b, \({\widetilde{\mathcal {I}_{b}}}(L_2) > {\widetilde{\mathcal {I}_{b}}}(L_1)\) if \(L_2\) returns a longer list than \(L_1\): all else being equal, a technique that returns a shorter list is “better” than one that returns a longer list since it requires less of the user’s time to inspect the output list. Accordingly, \({\widetilde{\mathcal {I}_{b}}}(L)\) denotes L’s generalized \(E_{\textrm{inspect}}\) rank on bug b—defined as in Fig. 5.

Exam Score Effectiveness

Another commonly used effectiveness metric is the exam score \(\mathcal {E}_b(L)\) (Wong et al. 2008), which is just a FL technique L’s \(E_{\textrm{inspect}}\) rank on bug b over the number of program entities \(|p_b^-|\) of the analyzed buggy program \(p_b^-\)—as in (7). Just like \(\mathcal {I}_{b}(L)\), \(\mathcal {E}_b(L)\) is undefined if L cannot localize b.

Effectiveness of a Technique

To assess the overall effectiveness of a FL technique over a set B of bugs, we aggregate the previously introduced metrics in different ways—as in (8). The \(L@_Bn\) metric counts the number of bugs in B that L could localize within the top-n positions (according to their \(E_{\textrm{inspect}}\) rank); \(n=1, 3, 5, 10\) are common choices for n, reflecting a “feasible” number of entities to inspect. Then, the \(L@_Bn\% = 100 \cdot L@_Bn / |B|\) metric is simply \(L@_Bn\) expressed as a percentage of the number |B| of bugs in B. \({\widetilde{\mathcal {I}_{B}}}(L)\) is L’s average generalized \(E_{\textrm{inspect}}\) rank of bugs in B. And \(\mathcal {E}_B(L)\) is L’s average exam score of bugs in B (thus ignoring bugs that L cannot localize).

Location List Length

The \(|L_{b}|\) metric is simply the number of suspicious locations output by FL technique L when run on bug b; and \(|L_{B}|\) is the average of \(|L_{b}|\) for all bugs in B. The location list length metric is not, strictly speaking, a measure of effectiveness; rather, it complements the information provided by other measures of effectiveness, as it gives an idea of how much output a technique produces to the user. All else being equal, a shorter location list length is preferable—provided it is not empty. In practice, we’ll compare the location list length to other metrics of effectiveness, in order to better understand the trade-offs offered by each FL technique.

Different FL families use different kinds of information to compute suspiciousness scores; this is also reflected by the entities that may appear in their output location list. SBFL techniques include all locations executed by any tests \(T_b\) (passing or failing) even if their suspiciousness is zero; conversely, they omit all locations that are not executed by the tests. MBFL techniques include all locations executed by any failing tests \(F_b\), since these locations are the targets of the mutation operators. PS includes all locations of predicates (branching conditions) that are executed by any failing tests \(F_b\) and that are critical (as defined in Section 2.3). ST includes all locations of all functions that appear in the stack trace of any crashing test in \(F_b\).

Effectiveness Metrics: limitations

Despite being commonly used in fault localization research, the effectiveness metrics presented in this section rely on assumptions that may not realistically capture the debugging work of developers. First, they assume that a developer can understand the characteristics of a bug and devise a suitable fix by examining just one buggy entity; in contrast, debugging often involves disparate activities, such as analyzing control and data dependencies and inspecting program states with different inputs (Parnin and Orso 2011a). Second, debugging is often not a linear sequence of activities (Ko and Myers 2008) as simple as going through the ranked list of entities produced by fault localization techniques. Despite these limitations, we still rely on this section’s effectiveness metrics: on the one hand, they are used in practically all related work on fault localization (in particular,  (Zhou et al. 2012)); thus, they make our results comparable to others. On the other hand, there are no viable, easy-to-measure alternative metrics that are also fully realistic; devising such metrics is outside this paper’s scope and belongs to future work.

4.6 Comparison: Statistical Models

To quantitatively compare the capabilities of different fault localization techniques, we consider several standard statistics.

Pairwise Comparisons

Let \(M_b(L)\) be any metric M measuring the capabilities of fault-localization technique L on bug b; M can be any of Section 4.5’s effectiveness metrics, or L’s wall-clock running time \(T_b(L)\) on bug b as performance metric. Similarly, for a fault-localization family F, \(M_b(F)\) denotes the average value \(\sum _{k \in F} M_b(k) / |F|\) of \(M_b\) for all techniques in family F. Given a set \(B = \{ b_1, \ldots , b_n \}\) of bugs, we compare the two vectors \(M_B(F_1) = \langle M_{b_1}(F_1)\,\ldots \,M_{b_n}(F_1) \rangle \) and \(M_B(F_2) = \langle M_{b_1}(F_2)\,\ldots \,M_{b_n}(F_2) \rangle \) using three statistics:

Correlation \(\tau \) between \(M_B(F_1)\) and \(M_B(F_2)\) computed using Kendall’s \(\tau \) statistics. The absolute value \(|\tau |\) of the correlation \(\tau \) measures how closely changes in the value of metric M for \(F_1\) over different bugs are associated to changes for \(F_2\) over the same bugs: if \(0 \le |\tau | \le 0.3\) the correlation is negligible; if \(0.3 < |\tau | \le 0.5\) the correlation is weak; if \(0.5 < |\tau | \le 0.7\) the correlation is medium; and if \(0.7 < |\tau | \le 1\) the correlation is strong.

P-value p of a paired Wilcoxon signed-rank test—a nonparametric statistical test comparing \(M_B(F_1)\) and \(M_B(F_2)\). A small value of p is commonly taken as evidence against the “null-hypothesis” that the distributions underlying \(M_B(F_1)\) and \(M_B(F_2)\) have different medians:Footnote 15 usually, \(p \le 0.05\), \(p \le 0.01\), and \(p \le 0.001\) are three conventional thresholds of increasing strength.

Cliff’s \(\delta \) effect size—a nonparametric measure of how often the values in \(M_B(F_1)\) are larger than those in \(M_B(F_2)\). The absolute value \(|\delta |\) of the effect size \(\delta \) measures how much the values of metric M differ, on the same bugs, between \(F_1\) and \(F_2\) (Romano et al. 2006): if \(0 \le |\delta | < 0.147\) the differences are negligible; if \(0.145 \le |\delta | < 0.33\) the differences are small; if \(0.33 \le |\delta | < 0.474\) the differences are medium; and if \(0.474 \le |\delta | \le 1\) the differences are large.

Regression Models

To ferret out the individual impact of several different factors (fault localization family, project category, and bug kind) on the capabilities of fault localization, we introduce two varying effects regression models with normal likelihood and logarithmic link function.

$$\begin{aligned} \left[ \begin{array}{l} {E}_b \\ {T}_b \end{array} \right] \sim \textsf {MVNormal} \left( \! \left[ \begin{array}{l} {e}_b \\ {t}_b \end{array} \right] \!, S\! \right) \qquad \log ({e}_b)&=\ \alpha + \alpha _{{family}[b]} + \alpha _{{category}[b]} \nonumber \\ \log ({t}_b)&= \beta + \beta _{{family}[b]} + \beta _{{category}[b]} \end{aligned}$$
(9)
$$\begin{aligned} {E}_b \sim \textsf {Normal}\,({e}_b,\ \sigma ) \qquad \log ({e}_b)&= \left( \begin{array}{cl} \alpha &{} +\ \alpha _{{family}[b]} + \alpha _{{category}[b]} \\ &{} +\ c_{{family}[b]} \,{crashing}_b \\ &{} +\ p_{{family}[b]} \,{predicate}_b \\ &{} +\ m_{{family}[b]} \,\log (1 + {mutability}_b) \end{array} \right) \end{aligned}$$
(10)

Model (9) is multivariate, as it simultaneously captures effectiveness and runtime cost of fault localization. For each fault localization experiment on a bug b, (9) expresses the vector \([ E_b, T_b ]\) of standardizedFootnote 16\(E_{\textrm{inspect}}\) metric \(E_b\) and running time \(T_b\) as drawn from a multivariate normal distribution whose means \(e_b\) and \(t_b\) are log-linear functions of various predictors. Namely, \(\log (e_b)\) is the sum of a base intercept \(\alpha \); a family-specific intercept \(\alpha _{{family}[b]}\), for each fault-localization family SBFL, MBFL, PS, and ST; and a category-specific intercept \(\alpha _{{category}[b]}\), for each project category cl, dev, ds, and web. The other model component \(\log (t_b)\) follows the same log-linear relation.

Model (10) is univariate, since it only captures the relation between bug kinds and effectiveness. For each fault localization experiment on a bug b, (10) expresses the standardized \(E_{\textrm{inspect}}\) metric \(E_b\) as drawn from a normal distribution whose mean \(e_b\) is a log-linear function of a base intercept \(\alpha \); a family-specific intercept \(\alpha _{{family}[b]}\); and a category-specific intercept \(\alpha _{{category}[b]}\); a varying intercept \(c_{{family}[b]} {crashing}_b\), for the interactions between each family and crashing bugs; a varying intercept \(p_{{family}[b]} {predicate}_b\), for the interactions between each family and predicate bugs; and a varying slope \(m_{{family}[b]} \log (1 + {mutability}_b)\), for the interactions between each family and bugs with different mutability.Footnote 17 Variables crashing and predicate are indicator variables, which are equal to 1 respectively for crashing or predicate-related bugs, and 0 otherwise; variable mutability is instead the mutability percentage defined in Section 4.3.

After completing regression models (9) and (10) with suitable priors and fitting them on our experimental dataFootnote 18 gives a (sampled) distribution of values for the coefficients \(\alpha \)’s, c, p, m, and \(\beta \)’s, which we can analyze to infer the effects of the various predictors on the outcome. For example, if the 95% probability interval of \(\alpha _{F}\)’s distribution lies entirely below zero, it suggests that FL family F is consistently associated with below-average values of \(E_{\textrm{inspect}}\) metric \(\mathcal {I}\); in other words, F tends to be more effective than techniques in other families. As another example, if the 95% probability interval of \(\beta _{C}\)’s distribution includes zero, it suggests that bugs in projects of category C are not consistently associated with different-than-average running times; in other words, bugs in these projects do not seem either faster or slower to analyze than those in other projects.

4.7 Experimental Methodology

To answer Section 4’s research questions, we ran FauxPy using each of the 7 fault localization techniques described in Section 2 on all 135 selected bugs (described in Section 4.1) from BugsInPy v. b4bfe91, for a total of \(945 = 7 \times 135 \) FL experiments. Henceforth, the term “standalone techniques” refers to the 7 classic FL techniques described in Section 2; whereas “combined techniques” refers to the four techniques introduced for RQ4.

Test Selection

The test suites of projects such as keras (included in BugsInPy) are very large and can take more than 24 hours to run even once. Without a suitable test selection strategy, large-scale FL experiments would be prohibitively time consuming (especially for MBFL techniques, which rerun the same test suite hundreds of times). Therefore, we applied a simple test selection strategy to only include tests that directly target the parts of a program that contribute to the failures.Footnote 19

As we mentioned in Section 4.1, each bug b in BugsInPy comes with a selection of failing tests \(F_b\) and passing tests \(P_b\). The failing tests are usually just a few, and specifically trigger bug b. The passing tests, in contrast, are much more numerous, as they usually include all non-failing tests available in the project. In order to cull the number of passing tests to only include those that expressly target the failing code, we applied a simple dependency analysis: for each BugsInPy bug b used in our experiments, we built the module-level call graph G(b) for the whole of b’s project;Footnote 20 each node in G(b) is a module of the project (including its tests), and each edge \(x_m \rightarrow y_m\) means that module \(x_m\) directly uses some entities defined in module \(y_m\). Consider any of b’s project test module \(t_m\); we run the tests in \(t_m\) in our experiments if and only if: i) \(t_m\) includes at least one of the failing tests in \(F_b\); ii) or, G(b) includes an edge \(t_m \rightarrow f_m\), where \(f_m\) is a module that includes at least one of b’s faulty locations \(\mathcal {F}(b)\) (see Section 4.2). In other words: we include all failing tests for b, as well as the passing tests that directly exercise the parts of the project that are faulty. This simple heuristics substantially reduced the number of tests that we had to run for the largest projects, without meaningfully affecting the fault localization’s scope.

Our test selection strategy does not include test modules that indirectly involve failing locations (unless they include any failing tests): if the tests in a module \(t_m\) only call directly an application module \(x_m\), and then some parts of module \(x_m\) call another application module \(y_m\) (i.e., \(t_m \rightarrow x_m \rightarrow y_m\) in the module-level call graph), \(x_m\) does not include any faulty locations, and \(y_m\) does include some faulty locations, then we do not include the tests in \(t_m\) in our test suite; instead, we will include other test modules \(u_m\) that directly call \(y_m\) (i.e., \(u_m \rightarrow y_m\)).

To demonstrate that our more aggressive test selection strategy does not exclude any relevant tests, and is unlikely to affect the quantitative fault localization results, we first computed, for each bug b used in our experiments: i) the set \(S^0_b\) of tests selected using the strategy described above; and ii) the set \(S^+_b \supseteq S^0_b\) of tests selected by including also indirect dependencies (i.e., by taking the transitive closure of the module-level use relation). For 48% of the 135 bugs used in our experiments, \(S^+_b = S^0_b\), that is both test selection strategies select the same tests. However, there remain a long tail of bugs for which including indirect dependencies leads to many more tests being selected; for example, for 40 bugs in 7 projects, considering indirect dependencies leads to selecting more than 7 additional tests—which would significantly increase the experiments’ running time. Thus, we randomly selected one bug for each project among those 40 bugs for which indirect dependencies would lead to including more than 50 additional tests. For each bug b in this sample, we performed an additional run of our fault localization experiments with SBFL and MBFL techniquesFootnote 21 using all tests in \(S^+_b\), for a total of 35 new experiments. We found that none of the key fault localization effectiveness metrics significantly changed compared to the same experiments using only tests in \(S^0_b\).Footnote 22 This confirms that our test selection strategy does not alter the general effectiveness of fault localization, and hence we adopted it for the rest of the paper’s experiments.

Table 4 shows statistics about the fraction of tests that we selected for our experiments according to the test selection strategy. Those data indicate that test selection has a disproportionate impact on projects that have very large test suites, such as those in the ds category. In these projects, it happens often that the vast majority of tests are irrelevant for the portion of the project where a failure occurred; therefore, excluding these tests from our experiments is instrumental in drastically bringing down execution times without sacrificing experimental accuracy.

Experimental Setup

Each experiment ran on a node of USI’s HPC cluster,Footnote 23 each equipped with 20-core Intel Xeon E5-2650 processor and 64 GB of DDR4 RAM, accessing a shared 15 TB RAID 10 SAS3 drive, and running CentOS 8.2.2004.x86_64. We provisioned three CPython Virtualenvs with Python v. 3.6, 3.7, and 3.8; our scripts chose a version according to the requirements of each BugsInPy subject. The experiments took more than two CPU-months to complete—not counting the additional time to setup the infrastructure, fix the execution scripts, and repeat any experiments that failed due to incorrect configuration.

This paper’s detailed replication package includes all scripts used to ran these experiments, as well as all raw data that we collected by running them. The rest of this section details how we analyzed and summarized the data to answer the various research questions.Footnote 24

Table 4 Tests used in the fault localization experiments with the bugs of Table 2

4.7.1 RQ1. Effectiveness

To answer RQ1 (fault localization effectiveness), we report the \(L@_B1\%\), \(L@_B3\%\), \(L@_B5\%\), and \(L@_B10\%\) counts, the average generalized \(E_{\textrm{inspect}}\) rank \({\widetilde{\mathcal {I}_{B}}}(L)\), the average exam score \(\mathcal {E}_B(L)\), and the average location list length \(|L_{B}|\) for each technique L among Section 2’s seven standalone fault localization techniques; as well as the same metrics averaged over each of the four fault localization families. These metrics measure the effectiveness of fault localization from different angles. We report these measures for all 135 BugsInPy bugs B selected for our experiments.

To qualitatively summarize the effectiveness comparison between two FL techniques A and B, we consider their counts \(A@1\% \le A@3\% \le A@5\% \le A@10\%\) and \(B@1\% \le B@3\% \le B@5\% \le B@10\%\) and compare them pairwise: \(A@k\%\) vs. \(B@k\%\), for the each k among 1, 3, 5, 10. We say that:

\(A\! \gg B\)::

A is much more effective than B”, if \(A@k\% > B@k\%\) for all ks, and \(A@k\% - B@k\% \ge 10\) for at least three ks out of four;

\(A > B\)::

A is more effective than B”, if \(A@k\% > B@k\%\) for all ks, and \(A@k\% - B@k\% \ge 5\) for at least one k out of four;

\(A \ge B\)::

A tends to be more effective than B”, if \(A@k\% \ge B@k\%\) for all ks, and \(A@k\% > B@k\%\) for at least three ks out of four;

\(A \simeq B\)::

A is about as effective as B”, if none of \(A \gg B\), \(A > B\), \(A \ge B\), \(B \gg A\), \(B > A\), and \(B \ge A\) holds.

To visually compare the effectiveness of different FL families, we use scatterplots—one for each pair \(F_1, F_2\) of families. The scatterplot comparing \(F_1\) to \(F_2\) displays one point at coordinates (xy) for each bug b analyzed in our experiments. Coordinate \(x = {\widetilde{\mathcal {I}_{b}}}(F_1)\), that is the average generalized \(E_{\textrm{inspect}}\) rank that techniques in family \(F_1\) achieved on b; similarly, \(y = {\widetilde{\mathcal {I}_{b}}}(F_2)\), that is the average generalized \(E_{\textrm{inspect}}\) rank that techniques in family \(F_2\) achieved on b. Thus, points lying below the diagonal line \(x = y\) (such that \(x > y\)) correspond to bugs for which family \(F_2\) performed better (remember that a lower \(E_{\textrm{inspect}}\) score means more effective fault localization) than family \(F_1\); the opposite holds for points lying above the diagonal line. The location of points in the scatterplot relative to the diagonal gives a clear idea of which family performed better in most cases.

To analytically compare the effectiveness of different FL families, we report the estimates and the 95% probability intervals of the coefficients \(\alpha _{F}\) in the fitted regression model (9), for each FL family F. If the interval of values lies entirely below zero, it means that family F’s effectiveness tends to be better than the other families on average; if it lies entirely above zero, it means that family F’s effectiveness tends to be worse than the other families; and if it includes zero, it means that there is no consistent association (with above- or below-average effectiveness).

4.7.2 RQ2. Efficiency

To answer RQ2 (fault localization efficiency), we report the average wall-clock running time \(T_B(L)\), for each technique L among Section 2’s seven standalone fault localization techniques, on bugs in B; as well as the same metric averaged over each of the four fault localization families. This basic metric measures how long the various FL techniques take to perform their analysis. We report these measures for all 135 BugsInPy bugs B selected for our experiments.

To qualitatively summarize the efficiency comparison between two FL techniques A and B, we compare pairwise their average running times T(A) and T(B), and say that:

\(A \gg B\)::

A is much more efficient than B”, if \(T(A) > 10 \cdot T(B)\);

\(A > B\)::

A is more efficient than B”, if \(T(A) > 1.1 \cdot T(B)\);

\(A \simeq B\)::

A is about as efficient as B”, if none of \(A \gg B\), \(A > B\), \(B \gg A\), and \(B > A\) holds.

To visually compare the efficiency of different FL families, we use scatterplots—one for each pair \(F_1, F_2\) of families. The scatterplot comparing \(F_1\) to \(F_2\) displays one point at coordinates (xy) for each bug b analyzed in our experiments. Coordinate \(x = T_b(F_1)\), that is the average running time of techniques in family \(F_1\) on b; similarly, \(y = T_b(F_2)\), that is the average running time of techniques in family \(F_2\) on b. The interpretation of these scatterplots is as those considered for RQ1.

To analytically compare the efficiency of different FL families, we report the estimates and the 95% probability intervals of the coefficients \(\beta _{F}\) in the fitted regression model (9), for each FL family F. The interpretation of the regression coefficients’ intervals is similar to those considered for RQ1: \(\beta _F\)’s lies entirely above zero when F tends to be slower (less efficient) than other families; it lies entirely below zero when F tends to be faster; and it includes zero when there is no consistent association with above- or below-average efficiency.

4.7.3 RQ3. Kinds of Faults and Projects

To answer RQ3 (fault localization behavior for different kinds of faults and projects), we report the same effectiveness metrics considered in RQ1 (\(F@_X1\%\), \(F@_X3\%\), \(F@_X5\%\), and \(F@_X10\%\) percentages, average generalized \(E_{\textrm{inspect}}\) ranks \({\widetilde{\mathcal {I}_{X}}}(F)\), average exam scores \(\mathcal {E}_X(F)\), and average location list length \(|F_{X}|\)), as well as the same efficiency metrics considered in RQ2 (average wall-clock running time \(T_X(F)\)) for each standalone fault localization family F and separately for i) bugs X of different kinds: crashing bugs, predicate bugs, and mutable bugs (see Fig. 4); ii) bugs X from projects of different category: cl, dev, ds, and web (see Section 4.3).

To visually compare the effectiveness and efficiency of fault localization families on bugs from projects of different category, we color the points in the scatterplots used to answer RQ1 and RQ2 according to the bug’s project category.

To analytically compare the effectiveness of different FL families on bugs of different kinds, we report the estimates and the 95% probability intervals of the coefficients \(c_F\), \(p_F\), and \(m_F\) in the fitted regression model (10), for each FL family F. The interpretation of the regression coefficients’ intervals is similar to those considered for RQ1 and RQ2: \(c_F\), \(p_F\), and \(m_F\) characterize the effectiveness of family F respectively on crashing, predicate, and mutable bugs, relative to the average effectiveness of the same family F on other kinds of bugs.

Finally, to understand whether bugs from projects of certain categories are intrinsically harder or easier to localize, we report the estimates and the 95% probability intervals of the coefficients \(\alpha _C\) and \(\beta _C\) in the fitted regression model (9), for each project category C. The interpretation of these regression coefficients’ intervals is like those considered for RQ1 and RQ2; for example if \(\alpha _C\)’s interval is entirely below zero, it means that bugs of projects in category C are easier to localize (higher effectiveness) than the average of bugs in any project. This sets a baseline useful to interpret the other data that answer RQ3.

4.7.4 RQ4. Combining Techniques

To answer RQ4 (the effectiveness of combining FL techniques), we consider two additional fault localization techniques: CombineFL and AvgFL—both combining the information collected by some of Section 2’s standalone techniques from different families.

CombineFL was introduced by Zou et al. (2021); it uses a learning-to-rank model to learn how to combine lists of ranked locations given by different FL techniques. After fitting the model on labeled training data,Footnote 25 one can use it like any other fault localization technique as follows: i) Run any combination of techniques \(L_1, \ldots , L_n\) on a bug b; ii) Feed the ranked location lists output by each technique into the fitted learning-to-rank model; iii) The model’s output is a list \(\ell _1, \ell _2, \ldots \) of locations, which is taken as the FL output of technique CombineFL. We used Zou et al. (2021)’s replication package to run CombineFL on the Python bugs that we analyzed using FauxPy.

To see whether a simpler combination algorithm can still be effective, we introduced the combined FL technique AvgFL, which works as follows: i) Each basic technique \(L_k\) returns a list \(\langle \ell _1^k, s_1^k \rangle \,\ldots \,\langle \ell _{n_k}^k, s_{n_k}^k \rangle \) of locations with normalizedFootnote 26 suspiciousness scores \(0 \le s_j^k \le 1\); ii) AvgFL assigns to location \(\ell _x\) the weighted average \(\sum _k w_k s_x^k\), where k ranges over all of FL techniques supported by FauxPy but Tarantula, and \(w_k\) is an integer weight that depends on the FL family of k: 3 for SBFL, 2 for MBFL, and 1 for PS and ST;Footnote 27iii) The list of locations ranked by their weighted average suspiciousness is taken as the FL output of technique AvgFL.

Finally, we answer RQ4 by reporting the same effectiveness metrics considered in RQ1 (the \(L@_B1\)%, \(L@_B3\)%, \(L@_B5\)%, and \(L@_B10\)% counts, the average generalized \(E_{\textrm{inspect}}\) rank \({\widetilde{\mathcal {I}_{B}}}(L)\), the average exam score \(\mathcal {E}_B(L)\), and the average location list length \(|L_{B}|\)) for techniques CombineFL and AvgFL. Precisely, we consider two variants A and S of CombineFL and of AvgFL, giving a total of four combined fault localization techniques: variants A (\(\textsf {CombineFL}_A\) and \(\textsf {AvgFL}_A\)) use the output of all FL techniques supported by FauxPy but Tarantula—which was not considered in Zou et al. (2021); variants S (\(\textsf {CombineFL}_S\) and \(\textsf {AvgFL}_S\)) only use the Ochiai, DStar, and ST FL techniques (excluding the more time-consuming MBFL and PS families).

4.7.5 RQ5. Granularity

To answer RQ5 (how fault localization effectiveness changes with granularity), we report the same effectiveness metrics considered in RQ1 (the \(L@_B1\), \(L@_B3\), \(L@_B5\), and \(L@_B10\) counts, the average generalized \(E_{\textrm{inspect}}\) rank \({\widetilde{\mathcal {I}_{B}}}(L)\), the average exam score \(\mathcal {E}_B(L)\), and the average location list length \(|L_{B}|\)) for all seven standalone techniques, and for all four combined techniques, but targeting functions and modules as suspicious entities. Similar to Zou et al. (2021), for function-level and module-level granularities, we define the suspiciousness score of an entity as the maximum suspiciousness score computed for the statements in them.

4.7.6 RQ6. Comparison to Java

To answer RQ6 (comparison between Python and Java), we quantitatively and qualitatively compare the main findings of Zou et al. (2021)—whose empirical study of fault localization in Java was the basis for our Python replication—against our findings for Python.

For the quantitative comparison of effectiveness, we consider the metrics that are available in both studies: the percentage of all bugs each technique localized within the top-1, top-3, top-5, and top-10 positions of its output (L@1%, L@3%, L@5%, and L@10%); and the average exam score. For Python, we consider all 135 BugsInPy bugs we selected for our experiments; the data for Java is about Zou et al.’s experiments on 357 bugs in Defects4J (Just et al. 2014). We consider all standalone techniques that feature in both studies: Ochiai and DStar (SBFL), Metallaxis and Muse (MBFL), predicate switching (PS), and stack-trace fault localization (ST).

We also consider the combined techniques \(\textsf {CombineFL}_A\) and \(\textsf {CombineFL}_S\). The original idea of the CombineFL technique was introduced by Zou et al.; however, the variants used in their experiments combine all eleven FL techniques they consider, some of which we did not include in our replication (see Section 3 for details). Therefore, we modified (Zou et al. 2021)’s replication package to extract from their Java experimental data the rankings obtained by \(\textsf {CombineFL}_A\) and \(\textsf {CombineFL}_S\) combining the same techniques as in Python (see Section 4.7.4). This way, the quantitative comparison between Python and Java involves exactly the same techniques and combinations thereof.

Since we did not re-run Zou et al.’s experiments on the same machines used for our experiments, we cannot compare efficiency quantitatively. Anyway, a comparison of this kind between Java and Python would be outside the scope of our studies, since any difference would likely merely reflect the different performance of Java and Python—largely independent of fault localization efficiency.

For the qualitative comparison between Java and Python, we consider the union of all findings presented in this paper or in Zou et al. (2021); we discard all findings from one paper that are outside the scope of the other paper (for example, Java findings about history-based fault localization, a standalone technique that we did not implement for Python; or Python findings about AvgFL, a combined technique that Zou et al. did not implement for Java); for each within-scope finding, we determine whether it is confirmed   (there is evidence corroborating it) or refuted   (there is evidence against it) for Python and for Java.

5 Experimental Results

This section summarizes the experimental results that answer the research questions detailed in Section 4.7. All results except for Section 5.5’s refer to experiments with statement-level granularity; results in Sections 5.15.3 only consider standalone techniques. To keep the discussion focused, we mostly comment on the \(@n\%\) metrics of effectiveness, whereas we only touch upon the exam score, \(E_{\textrm{inspect}}\), and location list length when they complement other results.

Table 5 Effectiveness of standalone fault localization techniques at the statement-level granularity on all 135 selected bugs B

5.1 RQ1. Effectiveness

Family Effectiveness

Among standalone techniques, the SBFL fault localization family achieves the best effectiveness according to several metrics. Table 5 shows that all SBFL techniques have better average \(E_{\textrm{inspect}}\) rank \(\widetilde{\mathcal {I}}\); and higher percentages of faulty locations in the top-1, top-3, top-5, and top-10. The advantage over MBFL—the second most-effective family—is consistent and conspicuous. According to the same metrics, the MBFL fault localization family achieves clearly better effectiveness than PS and ST. Then, PS tends to do better than ST, but only according to some metrics: PS has better @1%, @3%, and @5%, and location list length, whereas ST has better \(E_{\textrm{inspect}}\) and @10%.

figure k
figure l

Contrary to these general trends, PS achieves the best (lowest) exam score and location list length of all families; and ST is second-best according to these metrics. As Section 5.3 will discuss in more detail, PS and ST are techniques with a narrower scope than SBFL and MBFL: they can perform very well on a subset of bugs, but they fail spectacularly on several others. They also tend to return shorter lists of suspicious locations, which is also conducive to achieving a better exam score: since the exam score is undefined when a technique fails to localize a bug at all (as explained in Section 4.5), the average exam score of ST and, especially, PS is computed over the small set of bugs on which they work fairly well.

figure m
Fig. 6
figure 6

Pairwise visual comparison of four FL families for effectiveness. Each point in the scatterplot at row labeled R and column labeled C has coordinates (xy), where x is the generalized \(E_{\textrm{inspect}}\) rank \({\widetilde{\mathcal {I}_{b}}}(C)\) of FL techniques in family C and y is the rank \({\widetilde{\mathcal {I}_{b}}}(R)\) of FL techniques in family R on the same bug b. Thus, points below (resp. above) the diagonal line denote bugs on which R had better (resp. worse) \(E_{\textrm{inspect}}\) ranks. Points are colored according to the bug’s project category. The opposite box at row labeled C and column labeled R displays three statistics (correlation, p-value, and effect size, see Section 4.6) quantitatively comparing the same average generalized \(E_{\textrm{inspect}}\) ranks of C and R; negative values of effect size mean that R tends to be better, and positive values mean that C tends to be better. Each bar plot on the diagonal at row F, column F is a histogram of the distribution of \({\widetilde{\mathcal {I}_{b}}}(F)\) for all bugs. Horizontal axes of all diagonal plots have the same \(E_{\textrm{inspect}}\) scale as the bottom-right plot’s (SBFL); their vertical axes have the same 0–100% scale as the top-left plot (MBFL)

Figure 6’s scatterplots confirm SBFL’s general advantage: in each scatterplot involving SBFL, all points are on a straight line corresponding to low ranks for SBFL but increasingly high ranks for the other family. The plots also indicate that MBFL is often better than PS and ST, although there are a few hard bugs for which the latter are just as effective (points on the diagonal line). The PS-vs-ST scatterplot suggests that these two techniques are largely complementary: on several bugs, PS and ST are as effective (points on the diagonal); on several others, PS is more effective (points above the diagonal); and on others still, ST is more effective (points below the diagonal).

Fig. 7
figure 7

Point estimates (boxes) and 95% probability intervals (lines) for the regression coefficients of model (9). The scale of the vertical axes is over standard deviation log-units

Figure 7a confirms these results based on the statistical model (9): the intervals of coefficients \(\alpha _{\text {SBFL}}\) and \(\alpha _{\text {MBFL}}\) are clearly below zero, indicating that SBFL and MBFL have better-than-average effectiveness; conversely, those of coefficients \(\alpha _{\text {PS}}\) and \(\alpha _{\text {ST}}\) are clearly above zero, indicating that PS and ST have worse-than-average effectiveness.

Figure 7a’s estimate of \(\alpha _{\text {SBFL}}\) is below that of \(\alpha _{\text {MBFL}}\), confirming that SBFL is the most effective family overall. The bottom-left plot in Fig. 6 confirms that SBFL’s advantage can be conspicuous but is observed only on a minority of bugs—whereas SBFL and MBFL achieve similar effectiveness on the majority of bugs. In fact, the effect size comparing SBFL and MBFL is \(-0.18\)—weakly in favor of SBFL.

figure n

Technique Effectiveness

FL techniques of the same family achieve very similar effectiveness. Table 5 shows nearly identical results for the 3 SBFL techniques Tarantula, Ochiai, and DStar. The plots and statistics in Fig. 8 confirm this: points lie along the diagonal lines in the scatterplots, and \(E_{\textrm{inspect}}\) ranks for the same bugs are strongly correlated and differ by a vanishing effect size.

figure o

The 2 MBFL techniques also behave similarly, but not quite as closely as the SBFL ones. Metallaxis has a not huge but consistent advantage over Muse according to Table 5. Figure 9 corroborates this observation: the cloud of points in the scatterplot is centered slightly above the diagonal line; the correlation between Muse’s and Metallaxis’s data is medium (not strong); and the effect size suggests that Metallaxis is more effective on around 11% of subjects.

Muse’s lower effectiveness can be traced back to its stricter definition of “mutant killing”, which requires that a failing test becomes passing when run on a mutant (see Section 2.2). As observed elsewhere (Pearson et al. 2017), this requirement may be too demanding for fault localization of real-world bugs, where it is essentially tantamount to generating a mutant that is similar to a patch.

figure p
Fig. 8
figure 8

Pairwise visual comparison of 3 SBFL techniques for effectiveness. The interpretation of the plots is the same as in Fig. 6

Fig. 9
figure 9

Pairwise visual comparison of 2 MBFL techniques for effectiveness. The interpretation of the plots is the same as in Fig. 6

Table 6 Efficiency of fault localization techniques at the statement-level granularity

5.2 RQ2. Efficiency

As demonstrated in Table 6, the four FL families differ greatly in their efficiency—measured as their wall-clock running time. ST is by far the fastest, taking a mere 2 seconds per bug on average; SBFL is second-fastest, taking around 10 minutes on average; PS is one order of magnitude slower, taking approximately 2.7 hours on average; and MBFL is slower still, taking over 4 hours per bug on average.

figure z
Fig. 10
figure 10

Pairwise visual comparison of four FL families for efficiency. Each point in the scatterplot at row labeled R and column labeled C has coordinates (xy), where x is the average per-bug wall-clock running time of FL techniques in family C and y average per-bug wall-clock running time of FL techniques in family R. Points are colored according to the bug’s project category. The opposite box at row labeled C and column labeled R displays three statistics (correlation, p-value, and effect size, see Section 4.6) quantitatively comparing the same per-bug average running times of C and R; negative values of effect size mean that R tends to be better, and positive values that C tends to be better

Figure 10’s scatterplots confirm that ST outperforms all other techniques, and that SBFL is generally second-fastest. It also shows that MBFL and PS have similar overall performance but can be slower or faster on different bugs: a narrow majority of points lies below the diagonal line in the scatterplot (meaning PS is faster than MBFL), but there are also several points that are on the opposite side of the diagonal—and their effect size (0.34) is medium, lower than all other pairwise effect sizes in the comparison of efficiency.

figure aa

Based on the statistical model (9), Fig. 7a clearly confirms the differences of efficiency: the intervals of coefficients \(\beta _{\text {ST}}\) and \(\beta _{\text {SBFL}}\) are well below zero, indicating that ST and SBFL are faster than average (with ST the fastest, as its estimated \(\beta _{\text {ST}}\) is lower); conversely, the intervals of coefficients \(\beta _{\text {MBFL}}\) and \(\beta _{\text {PS}}\) are entirely above zero, indicating that MBFL and PS stand out as slower than average compared to the other families.

These major differences in efficiency are unsurprising if one remembers that the various FL families differ in what kind of information they collect for localization. ST only needs the stack-trace information, which only requires to run once the failing tests; SBFL compares the traces of passing and failing runs, which involves running all tests once. PS dynamically tries out a large number of different branch changes in a program, each of which runs the failing tests; in our experiments, PS tried 4588 different “switches” on average for each bug—up to a whopping 101 454 switches for project black’s bug #6. MBFL generates hundreds of different mutations of the program under analysis, each of which has to be run against all tests; in our experiments, MBFL generated 461 mutants on average for each bug—up to 2718 mutants for project black’s bug #6. After collecting this information, the additional running time to compute suspiciousness scores (using the formulas presented in Section 2) is negligible for all techniques—which explains why the running times of techniques of the same family are practically indistinguishable.

Table 7 Effectiveness of fault localization families at the statement-level granularity on different kinds of bugs and categories of projects

5.3 RQ3. Kinds of Faults and Projects

Project Category: effectiveness

Figure 7’s intervals of coefficients \(\alpha _{{category}}\) in model (9) indicate that fault localization tends to be more accurate on projects in categories dev and web, and less accurate on projects in categories cl and ds.

This finding is consistent with the observations that data science programs, their bugs, and their fixes are often different compared to traditional programs (Islam et al. 2019, 2020). For instance, bug #38 in project keras is an example of what Islam et al. call “structural data flow” bugs (Islam et al. 2019): its root cause is passing an incorrect input shape setting to a neural network layer. These characteristics also determine long spectra (i.e., execution traces) that span several functions—which are required to construct the various layer objects; as a result, SBFL techniques struggle to effectively localize this bug. Bugs #68 and #137 in project pandas are instead examples of API bugs, whose root causes are incorrect import statements. While such bugs may occur in any kind of project, they are common in data science programs (Islam et al. 2019) due to their complex dependencies. In Python, import statements are usually top-level declarations; therefore, FL techniques that can only target locations inside functions end up being ineffective at localizing these API bugs. As yet another example, the overall mutability of bugs in ds projects is 0.7%, whereas it is 1.3% for bugs in other categories of projects. This indicates that the standard mutation operators, used by MBFL, are a poor fit for the kinds of bugs that are most commonly found in data science projects.

figure ad

The data in Table 7’s bottom section confirm that SBFL remains the most effective FL family, largely independent of the category of projects it analyzes. MBFL ranks second for effectiveness in every project category; it is not that far from SBFL for projects in categories dev and cl (for example, MBFL and SBFL both localize 9% of cl bugs in the first position; and both localize over 40% of dev bugs in the top-10 positions). In contrast, SBFL’s advantage over MBFL is more conspicuous for projects in categories ds and web. Given that bugs in categories cl are generally harder to localize, this suggests that the characteristics of bugs in these projects seem to be a good fit for MBFL. As we have seen in Section 5.2, MBFL is the slowest FL family by far; since it reruns the available tests hundreds, or even thousands, of times, projects with a large number of tests are near impossible to analyze efficiently with MBFL. As we’ll discuss below, MBFL is considerably faster on projects in category cl than on projects in other categories; this is probably the main reason why MBFL is also more effective on these projects: it simply generates a more manageable number of mutants, which sharpen the dynamic analysis.

figure ae

Figure 6’s plots confirm some of these trends. In most plots, we see that the points positioned far apart from the diagonal line correspond to projects in the cl and dscategories, confirming that these “harder” bugs exacerbate the different effectiveness of the various FL families.

Project Category: efficiency

Figure 7’s intervals of coefficients \(\beta _{{category}}\) in model (9) indicate that fault localization tends to be more efficient (i.e., faster) on projects in category cl, and less efficient (i.e., slower) on projects in category ds(\(\beta _{\textsc {ds}}\) barely touches zero). In contrast, projects in categories dev and web do not have a consistent association with faster or slower fault localization. Table 2 shows that projects in category dshave the largest number of tests by far (mostly because of outlier project pandas); furthermore, some of their tests involve training and testing different machine learning models, or other kinds of time-consuming tasks. Since FL invariably requires to run tests, this explains why bugs in ds projects tend to take longer to localize.

figure af

The data in Table 6’s right-hand side generally confirm the same rankings of efficiency among FL families, largely regardless of what category of projects we consider: ST is by far the most efficient, followed by SBFL, and then—at a distance—PS and MBFL. The difference of performance between SBFL and ST is largest for projects in category ds (three orders of magnitude), large for projects in category web (two orders of magniture), and more moderate for projects in categories cl and dev (one order of magnitude). PS is slower than MBFL only for projects in category dev, although their absolute difference of running times is not very big (around 7.5%); in contrast, it is one order of magnitude faster for projects in categories cl and web.

figure ag

In most of Fig. 10’s plots, we see that the points most frequently positioned far apart from the diagonal line correspond to projects in category ds, confirming that these bugs take longer to analyze and aggravate performance differences among techniques. In the scatterplot comparing MBFL to PS, points corresponding to projects in categories web and cl are mostly below the diagonal line, which corroborates the advantage of PS over MBFL for bugs of projects in these two categories.

Fig. 11
figure 11

Point estimates (boxes) and 95% probability intervals (lines) for the regression coefficients of model (10). The scale of the vertical axes is over standard deviation log-units

Crashing Bugs: effectiveness

According to Fig. 11a, both FL families ST and MBFL are more effective on crashing bugs than on other kinds of bugs. Still, their absolute effectiveness on crashing bugs remains limited compared to SBFL’s, as shown by the results in Table 7’s middle part; for example, \(@_{\textsc {crashing}}10\%\) is 37% for ST, 34% for MBFL, and 53% for SBFL, whereas ST localizes zero (crashing) bugs in the top rank. Remember that ST assigns that same suspiciousness to all statements within the same function (see Section 2.4); thus, it cannot be as accurate as SBFL even on the minority of crashing bugs.

figure ah

On the other hand, PS is less effective on crashing bugs than on other kinds of bugs; in fact, it localizes zero bugs among the top-10 ranks. PS has a chance to work only if it can find a so-called critical predicate (see Section 2.3); only three of the crashing bugs included critical predicates, and hence PS was a bust.

figure ai

Predicate-related Bugs: effectiveness

Figure 11b says that no FL family achieves consistently better or worse effectiveness on predicate-related bugs. Table 7 complements this observation; the ranking of families by effectiveness is different for predicate-related bugs than it is for all bugs: MBFL is about as effective as SBFL, whereas PS is clearly more effective than ST.

figure aj

This outcome is somewhat unexpected for PS: predicate-related bugs are bugs whose ground truth includes at least a branching predicate (see Section 4.3), and yet PS is still clearly less effective than SBFL or MBFL. Indeed, the presence of a faulty predicate is not sufficient for PS to work: the predicate must also be critical, which means that flipping its value turns a failing test into a passing one. When a program has no critical predicates, PS simply returns an empty list of locations. In contrast, when a program has a critical predicate, PS is highly effective: \(\text {PS}@_{\chi }1\% = 14\%\), \(\text {PS}@_{\chi }3\% = 24\%\), and \(\text {PS}@_{\chi }5\% = 31\%\) for PS on the 29 bugs \(\chi \) with a critical predicate—even better than SBFL’s results for the same bugs (\(\text {SBFL}@_{\chi }1\% = 13\%\), \(\text {SBFL}@_{\chi }3\% = 16\%\), and \(\text {SBFL}@_{\chi }5\% = 20\%\)). In all, PS is a highly specialized FL technique, which works quite well for a narrow category of bugs, but is inapplicable in many other cases.

figure ak

Mutable Bugs: effectiveness

According to Fig. 11c, FL family MBFL tends to be more effective on mutable bugs than on other kinds of bugs: \(m_{\text {MBFL}}\) 95% probability interval is mostly below zero (and the 87% probability interval would be entirely below zero). Furthermore, Table 7 shows that MBFL is the most effective technique on mutable bugs, where it tends to outperform even SBFL. Intuitively, a bug is mutable if the syntactic mutation operators used for MBFL “match” the fault in a way that it affects program behavior. Thus, the capabilities of MBFL ultimately depend on the nature of faults it analyzes and on the selection of mutation operators it employs.

figure al

Figure 11c also suggests that PS and ST are less effective on mutable bugs than on other kinds of bugs. Possibly, this is because mutable bugs tend to be more complex, “semantic” bugs, whereas ST works well only for “simple” crashing bugs, and PS is highly specialized to work on a narrow group of bugs.

figure am

Bug Kind: efficiency

Table 6 does not suggest any consistent changes in the efficiency of FL families when they work on crashing, predicate-related, or mutable bugs—as opposed to all bugs. In other words, for every kind of bugs: ST is orders of magnitude faster than SBFL, which is one order of magnitude faster than PS, which is 14–37% faster than MBFL. As discussed above, the kind of information that a FL technique collects is the main determinant of its overall efficiency; in contrast, different kinds of bugs do not seem to have any significant impact.

figure an
Table 8 Effectiveness and efficiency of fault localization techniques AvgFL and CombineFL at the statement-level granularity on all 135 selected bugs B

5.4 RQ4. Combining Techniques

Effectiveness

Table 8 clearly indicates that the combined FL techniques AvgFL and CombineFL achieve high effectiveness—especially according to the fundamental \(@n\%\) metrics. \(\textsf {CombineFL}_A\) and \(\textsf {AvgFL}_A\), combining the information from all other FL techniques, beat every other technique. For example, \(\textsf {AvgFL}_A\) localizes in the top position 18% of all bugs, \(\textsf {CombineFL}_A\) localizes 20% of all bugs, whereas the next-best technique is SBFL, which localizes 12% of all bugs (Table 5). \(\textsf {CombineFL}_S\) and \(\textsf {AvgFL}_S\), combining the information from only SBFL and ST techniques, do at least as well as every other standalone technique.

figure aq

While \(\textsf {CombineFL}_A\) is strictly more effective than \(\textsf {AvgFL}_A\), their difference is usually modest (at most three percentage points). Similarly, the difference between \(\textsf {CombineFL}_S\), \(\textsf {AvgFL}_S\), and SBFL is generally limited; however, SBFL tends to be less effective than \(\textsf {AvgFL}_S\), whereas \(\textsf {CombineFL}_S\) is never strictly more effective than \(\textsf {AvgFL}_S\). In all, AvgFL is a simpler approach to combining techniques than CombineFL, but both are quite successful at boosting FL effectiveness.

figure ar

The suspicious location length is the very same for AvgFL and CombineFL, and higher than for every other technique. This is simply because all variants of AvgFL and CombineFL consider a location as suspicious if and only if any of the techniques they combine considers it so. Therefore, they end up with long location lists—at least as long as any combined technique’s.

Efficiency

The running time of AvgFL and CombineFL is essentially just the sum of running times of the FL families they combine, because merging the output list of locations and training CombineFL’s machine learning model take negligible time. This makes \(\textsf {AvgFL}_A\) and \(\textsf {CombineFL}_A\) the least efficient FL techniques in our experiments; and \(\textsf {AvgFL}_S\) and \(\textsf {CombineFL}_S\) barely slower than SBFL.

figure as

Combining these results with those about effectiveness, we conclude that \(\textsf {AvgFL}_A\) and \(\textsf {CombineFL}_A\) exclusively favor effectiveness; whereas \(\textsf {AvgFL}_S\) and \(\textsf {CombineFL}_S\) promise a modest improvement in effectiveness in exchange for a modest performance loss.

figure at
Table 9 Effectiveness of fault localization techniques at the function-level granularity on all 135 selected bugs B
Table 10 Effectiveness of fault localization techniques at the module-level granularity on all 135 selected bugs B
Table 11 Effectiveness of fault localization techniques in Python and Java

5.5 RQ5. Granularity

Function-level Granularity

Table 9’s data about function-level effectiveness of the various FL techniques and families lead to very similar high-level conclusions as for statement-level effectiveness: combination techniques \(\textsf {CombineFL}_A\) and \(\textsf {AvgFL}_A\) achieves the best effectiveness, followed by \(\textsf {CombineFL}_S\) and \(\textsf {AvgFL}_S\), then SBFL, and finally MBFL; differences among techniques in the same family are modest (often negligible).

ST is the only technique whose relative effectiveness changes considerably from statement-level to function-level: ST is the least effective at the level of statements, but becomes considerably better than PS at the level of functions. This change is no surprise, as ST is precisely geared towards localizing functions responsible for crashes—and cannot distinguish among statements belonging to the same function. ST’s overall effectiveness remains limited, since the technique is simple and can only work on crashing bugs.

Module-level Granularity

Table 10 leads to the same conclusions for module-level granularity: the relative effectiveness of the various techniques is very similar as for statement-level granularity, except that ST gains effectiveness simply because it is designed for coarser granularities.

figure ba

Comparisons Between Granularities

It is apparent that fault localization’s absolute effectiveness strictly increases as we target coarser granularities—from statements, to functions, to modules. This happens simply because the number of entities at a coarser granularity is considerably less than the number of entities at a finer granularity: each function consists of several statements, and each module consists of several functions. Therefore, it does not make sense to directly compare the same effectiveness metric measured at two different granularity levels, since each granularity level refers to different entities—and inspecting different entities involves incomparable effort.

We do not discuss efficiency (i.e., running time) in relation to granularity: the running time of our fault localization techniques does not depend on the chosen level of granularity, which only affects how the collected information is combined (see Section 2).

5.6 RQ6. Comparison to Java

Table 11 collects the main quantitative results for Python fault localization effectiveness that we presented in detail in previous parts of the paper, and displays them next to the corresponding results for Java. The results are selected so that they can be directly compared: they exclude any technique (e.g., Tarantula) or family (e.g., history-based fault localization) that was not experimented within both our paper and (Zou et al. 2021); and the rows about CombineFL were computed using (Zou et al. 2021)’s replication package so that they combine exactly the same techniques (DStar, Ochiai, Metallaxis, Muse, PS, and ST for \(\textsf {CombineFL}_A\); and DStar, Ochiai, and ST for \(\textsf {CombineFL}_S\)).

Table 12 A comparison of findings about fault localization in Python vs. Java

Then, Table 12 lists all claims about fault localization made in our paper or in Zou et al. (2021) that are within the scope of both papers, and shows which were confirmed or refuted for Python and for Java. Most of the findings (25/28) were confirmed consistently for both Python and Java. Thus, the big picture about the effectiveness and efficiency of fault localization is the same for Python programs and bugs as it is for Java programs and bugs.

There are, however, a few interesting discrepancies; let’s discuss possible explanations for them. The most marked difference is about the effectiveness of ST, which was mediocre on Python programs but competitive on Java programs (row 3 in Table 12). We think the main reason for these differences is that there were more Java experimental subjects that were an ideal target for ST: 20 out of the 357 Defects4J bugs used in Zou et al. (2021)’s experiments consisted of short failing methods whose programmer-written fixes entirely replaced or removed the method body.Footnote 28 In these cases, the ground truth consists of all locations within the method; thus, ST easily ranks the fault location at the top by simply reporting all lines of the crashing method with the same suspiciousness. As a result, Table 11 shows that ST was consistently more effective than PS in the Java experiments—whereas there was no consistent difference between ST and PS in our Python experiments. For the same reason, the difference between Java and Python is even more evident on crashing bugs: ST outperformed all other techniques on such bugs in Java but not in Python (row 19 in Table 12). We still confirmed that ST works better on crashing bugs than on other kinds of bugs in Python as well, but the nature of our experimental subjects did not allow ST to reach an overall competitive effectiveness on crashing bugs.

Other findings about MBFL were different in Python compared to Java, but the differences were more nuanced in this case. In particular, Zou et al. found that the correlation between the effectiveness of SBFL and MBFL techniques is negligible, whereas we found a medium correlation (\(\tau = 0.54\)). It is plausible that the discrepancy (reflected in Table 12’s row 23) is simply a result of several details of how this correlation was measured: we use Kendall’s \(\tau \), they use the coefficient of determination \(r^2\); we use a generalized \(E_{\textrm{inspect}}\) measure \(\widetilde{\mathcal {I}}\) that applies to all bugs, they exclude experiments where a technique completely fails to localize the bug (\(\mathcal {I}\)); we compare the average effectiveness of SBFL vs. MBFL techniques, they pairwise compare individual SBFL and MBFL techniques. Even if the correlation patterns were actually different between Python and Java, this would still have limited practical consequences: MBFL and SBFL techniques still have clearly different characteristics, and hence they remain largely complementary. The same analysis applies to the other correlation discrepancy (reflected in Table 12’s row 25): in Python, we found a medium correlation between the effectiveness of the Metallaxis and Muse MBFL techniques (\(\tau = 0.62\)); in Java, Zou et al. found negligible correlation.

Table 13 A comparison of MBFL’s and SBFL’s effectiveness on Python and Java predicate-related bugs

Finally, a clarification about the finding that “On predicate-related bugs, MBFL is about as effective as SBFL”, which Table 12 reports as confirmed for both Python and Java. This claim hinges on the definition of “about as effective”, which we rigorously introduced in Section 4.7.1. To clarify the comparison, Table 13 displays the Python and Java data about the effectiveness of MBFL and SBFL on predicate bugs. On Python predicate-related bugs (left part of Table 13), MBFL achieves better \(@3\%\), \(@5\%\), and \(@10\%\) than SBFL but a worse \(@1\%\) (by only one percentage point); similarly, on Java predicate-related bugs (right part of Table 13), MBFL achieves better \(@1\%\), \(@3\%\), and \(@5\%\) than SBFL but a worse \(@10\%\) (by three percentage points). In both cases, MBFL is not strictly better than SBFL, but one could argue that a clear tendency exists. Regardless of the definition of “more effective” (which can be arbitrary), the conclusion we can draw remain very similar in Python as in Java.

figure dq

5.7 Threats to Validity

Construct Validity

refers to whether the experimental metrics adequately operationalize the quantities of interest. Since we generally used widely adopted and well-understood metrics of effectiveness and efficiency, threats of this kind are limited.

The metrics of effectiveness are all based on the assumption that users of a fault localization technique process its output list of program entities in the order in which the technique ranked them. This model has been criticized as unrealistic (Parnin and Orso 2011b); nevertheless, the metrics of effectiveness remain the standard for fault localization studies, and hence are at least adequate to compare the capabilities of different techniques and on different programs.

Using BugsInPy’s curated collection of Python bugs helps reduce the risks involved with our selection of subjects; as we detail in Section 4.1, we did not blindly reuse BugsInPy’s bugs but we first verified which bugs we could reliably reproduce on our machines.

Internal Validity

can be threatened by factors such as implementation bugs or inadequate statistics, which may jeopardize the reliability of our findings. We implemented the tool FauxPy to enable large-scale experimenting with Python fault localization; we applied the usual best practices of software development (testing, incremental development, refactoring to improve performance and design, and so on) to reduce the chance that it contains fundamental bugs that affect our overall experimental results. To make it a robust and scalable tool, FauxPy’s implementation uses external libraries for tasks, such as coverage collection and mutant generation, for which high-quality open-source implementations are available.

The scripts that we used to process and summarize the experimental results may also include mistakes; we checked the scripts several times, and validated the consistency between different data representations.

We did our best to validate the test-selection process (described in Section 4.7), which was necessary to make feasible the experiments with the largest projects; in particular, we ran fault localization experiments on about 30 bugs without test selection, and checked that the results did not change after we applied test selection.

Our statistical analysis (Section 4.6) follows best practices (Furia et al. 2022), including validations and comparisons of the chosen statistical models (detailed in the replication package). To further help future replications and internal validity, we make available all our experimental artifacts and data in a detailed replication package.

External Validity

is about generalizability of our findings. Using bugs from real-world open-source projects substantially mitigates the threat that our findings do not apply to realistic scenarios. Precisely, we analyzed 135 bugs in 13 projects from the curated BugsInPy collection, which ensures a variety of bugs and project types.

As usual, we cannot make strong claims that our findings generalize to different application scenarios, or to different programming languages. Nevertheless, our study successfully confirmed a number of findings about fault localization in Java (Zou et al. 2021) (see Section 5.6), which further mitigates any major threats to external validity.

Zou et al.’s study used the Defects4J (Just et al. 2014) curated collection of real-world Java faults as their experimental subjects; we used the BugsInPy (Widyasari et al. 2020) curated collection of real-world Python faults. This invariably limits the generalizability of our findings to all Python programs, and the generalizability of our comparison to all Python vs. Java programs: the two curated collections of bugs may not represent all programs and faults in Python or Java. While there is always a risk that any selection of experimental subjects is not fully representative of the whole population, choosing standard well-known benchmarks such as Defects4J and BugsInPy helps mitigate this threat. First, BugsInPy was explicitly inspired by Defects4J, and was built following a very similar approach but applied to real-world open-source Python programs. Second, BugsInPy projects were “selected as they represent the diverse domains [...] that Python is used for” (Widyasari et al. 2020, Section 1), which bodes well for generalizability. Third, BugsInPy and Defects4J are extensible frameworks, which have been and will be extended with new projects and bugs; thus, using them as the basis of FL studies helps to make future research in this area comparable to previous results. While BugsInPy and Defects4J are only imperfect proxies for a fully general comparison of FL in Java and Python, they are a sensible basis given the current state of the art.

6 Conclusions

This paper described an extensive empirical study of fault localization in Python, based on a differentiated conceptual replication of Zou et al.’s recent Java empirical study (Zou et al. 2021). Besides replicating for Python several of their results for Java, we shed light on some nuances, and released detailed experimental data that can support further replications and analyses.

As a concluding discussion, let’s highlight a few points relevant for possible follow-up work. Section 6.1 discusses a different angle for a comparison with other studies, suggested by Widyasari et al.’s recent work (Widyasari et al. 2022). Section 6.2 describes broader ideas to improve the capabilities of fault localization in Python.

Table 14 A summary of some data presented in Widyasari et al.’s fault localization study (Widyasari et al. 2022) vis-à-vis analogous data presented in this paper

6.1 Other Fault Localization Studies

As we discussed in Section 3, Widyasari et al.’s recent work (Widyasari et al. 2022) is the only other large-scale study targeting fault localization in real-world Python projects. We also explained how our study’s goals and methodology is quite different from theirs; as a result, we cannot directly compare most of their findings to ours. Now that we have presented our results in detail, we are in a better position to discuss how Widyasari et al.’s methodology suggests future work that complements our own.

Widyasari et al. directly compare FL effectiveness metrics (such as exam score) between their experiments on Python subjects from BugsInPy and Pearson et al.’s experiments on Java subjects from Defects4J (Pearson et al. 2017). Table 14a displays the key results of their comparison, alongside a roughly similar comparison between our experiments on Python subjects from BugsInPy and Zou et al.’s experiments on Java subjects from Defects4J (Zou et al. 2021). The picture that emerges from these comparisons is somewhat inconclusive: in our comparison, there is a significant difference, with large effect size, between Python and Java with respect to exam scores, but not with respect to the \(E_{\textrm{inspect}}\) metric; conversely, in their comparison, there is a significant difference, with large/medium effect size, between Python and Java with respect to the top-k ranks in the best-case debugging scenarios (roughly analogous to the \(E_{\textrm{inspect}}\) ranking metric), whereas the differences with respect to exam scores are significant but with small effect sizes. Furthermore, the sign of the effect sizes is opposite: in our comparison, fault localization is more effective on Python programs (negative effect sizes); in their comparison, it is more effective on Java programs (positive effect sizes). It is plausible to surmise that these inconsistencies reflect differences between the effectiveness metrics, how they are measured in each study, and—most important—differences between the experimental subjects; the exam score metric, in particular, also depends on the size of the programs under analysis. As we discussed in Section 5.7, even though both benchmarks BugsInPy and Defects4J are carefully curated and of significant size, there is the risk that they do not necessarily represent all Python and Java real-world projects and their faults. This suggests that follow-up studies targeting different projects in Python and Java (or different selections of projects from BugsInPy and Defects4J) could help validate the generalizability of any results. Conversely, applying stricter project and bug selection criteria could also be useful not to generalize findings, but to strengthen their validity in more specific settings (for example, with projects of certain characteristics). Without provisioning stricter experimental controls, directly comparing, fault localization effectiveness metrics on sundry programs in two different programming languages, as we did in Table 14a for the sake of illustration, is unlikely to lead to clear-cut, robust findings.

Even though Widyasari et al.’s study found some statistically significant differences of effectiveness between SBFL techniques, those differences tend to be modest or insignificant. As shown in Table 14b, this is largely consistent with our findings: even though we found some weakly statistically significant differences between SBFL techniques (between DStar and Tarantula for \(p < 0.1\), and between Ochiai and Tarantula for \(p < 0.06\)) these have little practical consequence as the effect sizes of the differences are vanishing small.

Our study did not consider two dimensions of analysis that play an important role in Widyasari et al.’s study: different debugging scenarios, and a classification of faults according to their syntactic characteristics. Debugging scenarios determine how we classify a fault as localized when it affects multiple lines. In our paper, we only considered the “best-case” scenario: as long as any of the ground-truth locations is localized, we consider the fault localized. Widyasari et al. also consider other scenarios such as the worst-case scenario (all ground-truth locations must be localized). While they did not find any significant differences in the various findings under different debugging scenarios, investigating the robustness of our empirical findings in different scenarios remains a viable direction for future work.

6.2 Future Work

One of the dimensions of analysis that we included in our empirical study was the classification of projects (and their bugs) in categories, which led to the finding that faults in data science projects tend to be harder and take longer to localize. This is not a surprising finding if we consider the sheer size of some of these projects (and of their test suites). However, it also highlights an important category of projects that are much more popular in Python as opposed to more “traditional” languages like Java. In fact, a lot of the exploding popularity of Python in the last decade has been connected to its many usages for statistics, data analysis, and machine learning. Furthermore, there is growing evidence that these applications have distinctive characteristics—especially when it comes to faults (Islam et al. 2019; Humbatova et al. 2020; Rezaalipour and Furia 2023). Thus, investigating how fault localization can be made more effective for certain categories of projects is an interesting direction for related work (which we briefly discussed in Section 3).

It is remarkable that SBFL techniques, proposed nearly two decades ago (Jones and Harrold 2005), still remain formidable in terms of both effectiveness and efficiency. As we discussed in Section 3, MBFL was introduced expressly to overcome some limitations of SBFL. In our experiments (similarly to Java projects (Zou et al. 2021)) MBFL performed generally well but not always on par with SBFL; furthermore, MBFL is much more expensive to run than SBFL, which may put its practical applicability into question. Our empirical analysis of “mutable” bugs (Section 5.3) indicated that MBFL loses to SBFL usually when its mutation operators are not applicable to the faulty statements (which happened for nearly half of the bugs we used in our experiments); in these cases, the mutation analysis will not bring relevant information about the faulty parts of the program. These observations raise the question of whether it is possible to predict the effectiveness of MBFL based on preliminary information about a failure; and whether one can develop new mutation operators that extend the practical capabilities of MBFL to new kinds of bugs.

More generally, one could try to relate the various kinds of source-code edits (add, remove, modify) (Sobreira et al. 2018) introduced to fix a fault to the effectiveness of different fault localization algorithms. We leave answering these questions to future research in this area.