How bugs are born: a model to identify how bugs are introduced in software components

Rodríguez-Pérez, Gema; Robles, Gregorio; Serebrenik, Alexander; Zaidman, Andy; Germán, Daniel M.; Gonzalez-Barahona, Jesus M.

doi:10.1007/s10664-019-09781-y

How bugs are born: a model to identify how bugs are introduced in software components

Open access
Published: 04 February 2020

Volume 25, pages 1294–1340, (2020)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

How bugs are born: a model to identify how bugs are introduced in software components

Download PDF

Gema Rodríguez-Pérez¹,
Gregorio Robles²,
Alexander Serebrenik³,
Andy Zaidman⁴,
Daniel M. Germán⁵ &
…
Jesus M. Gonzalez-Barahona²

15k Accesses
33 Citations
14 Altmetric
Explore all metrics

Abstract

When identifying the origin of software bugs, many studies assume that “a bug was introduced by the lines of code that were modified to fix it”. However, this assumption does not always hold and at least in some cases, these modified lines are not responsible for introducing the bug. For example, when the bug was caused by a change in an external API. The lack of empirical evidence makes it impossible to assess how important these cases are and therefore, to which extent the assumption is valid. To advance in this direction, and better understand how bugs “are born”, we propose a model for defining criteria to identify the first snapshot of an evolving software system that exhibits a bug. This model, based on the perfect test idea, decides whether a bug is observed after a change to the software. Furthermore, we studied the model’s criteria by carefully analyzing how 116 bugs were introduced in two different open source software projects. The manual analysis helped classify the root cause of those bugs and created manually curated datasets with bug-introducing changes and with bugs that were not introduced by any change in the source code. Finally, we used these datasets to evaluate the performance of four existing SZZ-based algorithms for detecting bug-introducing changes. We found that SZZ-based algorithms are not very accurate, especially when multiple commits are found; the F-Score varies from 0.44 to 0.77, while the percentage of true positives does not exceed 63%. Our results show empirical evidence that the prevalent assumption, “a bug was introduced by the lines of code that were modified to fix it”, is just one case of how bugs are introduced in a software system. Finding what introduced a bug is not trivial: bugs can be introduced by the developers and be in the code, or be created irrespective of the code. Thus, further research towards a better understanding of the origin of bugs in software projects could help to improve design integration tests and to design other procedures to make software development more robust.

An empirical study of fault localization in Python programs

Article Open access 13 June 2024

Software defect prediction: future directions and challenges

Article 27 February 2024

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

During the life of a software product developers often fix bugs^{Footnote 1} (Pan et al. 2009; Murphy-Hill et al. 2015). Research has shown that developers spend half of their time fixing bugs; while they devote only about 36% to adding features (the rest goes to making code more maintainable) (LaToza et al. 2006). Fixing a bug consists of determining why software is behaving erroneously, and subsequently correcting the part of the component that causes that erroneous behavior (Zeller 2009; Beller et al. 2018; Beller et al. 2015; Ebert et al. 2015). A developer fixing a bug produces a change to the source code, which can be identified unambiguously as the bug-fixing change (BFC). However, identifying what change(s) introduced the bug has proven to be a more difficult task (da Costa et al. 2017; Rodríguez-Pérez et al. 2018a).

Nonetheless, identifying the changes that introduced bugs would enable to (1) discover bug introduction patterns which could be used to develop techniques to avoid changes introducing bugs (Hassan 2009; Hassan and Holt 2005; Kim et al. 2007); (2) identify who was responsible for introducing the bug for the sake of self-learning and peer-assessment (Izquierdo-Cortazar et al. 2011; da Costa et al. 2014; Ell 2013); or (3) understand how long the bug has been present in the code (e.g., to infer how many released versions have been affected or how effective the project testing/verification strategy is (Rodriguez-Perez et al. 2017; Chen et al. 2014; Weiss et al. 2007)). For these, among other reasons, identifying what changes introduced bugs has been a very active area of research over the last decade (Abreu and Premraj 2009; Aranda and Venolia 2009; da Costa et al. 2017).

The vast majority of this research is based on the assumption that a bug was introduced by the lines of code that were modified to fix it (Śliwerski et al. 2005; Kim et al. 2006; Williams and Spacco 2008). Although the literature frequently uses this assumption, there is not enough empirical evidence supporting it. Indeed, recent studies have demonstrated that well-known algorithms based on this assumption (such as the approach proposed by Sliwerski, Zimmermann, and Zeller (SZZ) (2005)) tend to incorrectly identify the bug-introducing changes (BICs) (da Costa et al. 2017; Rodríguez-Pérez et al. 2018a). For some bugs an explicit change introducing it does not even exist; the system behaves incorrectly due to changes that are external to the system (German et al. 2009; Rodríguez-Pérez et al. 2018b).

In this work we focus on analyzing how bugs were introduced in a software component, therefore we evaluate whether the aforementioned assumption holds.

For a major part, this work has been possible because in modern software development the history of a software product is typically recorded in a source code management (SCM) system, which enables researchers to retrieve and trace all changes to its source code, and understand the reasons why a change fixed a bug.

We selected two open source projects, Nova and ElasticSearch, as exploratory case studies to understand and locate, whenever possible, what change(s) introduced bugs and their characteristics. We analyze those cases in which a BFC in the SCM of Nova and ElasticSearch can be associated with a bug. To accomplish this task, we identify bugs in the system using the issue tracker system (ITS) (bugs that were fixed directly in the source code without an entry in the bug tracker system (Aranda and Venolia 2009) are outside the scope of this research). The ITS links directly to the change (commit) that fixed the bug (its BFC). Using this information, we will navigate back the history of the source code to identify the origin for each of the bugs in both case studies.

1.1 Goal: A Model of How Bugs Were Introduced

Based on this analysis, we propose a model of how bugs were introduced, from which the assumption that a bug was introduced by the lines of code that were modified to fix it can be derived as a specific case. The model classifies bugs into two categories: (1) intrinsic bugs: bugs that were introduced by one or more specific changes to the source code; and (2) extrinsic bugs: bugs that were introduced by changes not registered in the SCM (e.g., from an external dependency), or changes in requirements.

The proposed model will be of help in the complex task of identifying the origin of bugs, particularly, the idea of the “perfect test”. This idea is fundamental (1) to decide whether a snapshot^{Footnote 2} of a software component is affected by a bug; and (2) to identify which version of a software component exhibits the bug for the first time. Furthermore, this model is necessary for two main reasons: (1) its application in real-world cases provides the formalisms (e.g., definitions) to create a manually curated dataset with bug-introducing changes, when they exist; and (2) it can precisely define criteria to decide the first manifestation of a bug in the history of an open source software product.

The current absence of such criteria causes ambiguity of what snapshot should be considered as “exhibiting a bug”, which renders any approach to find the BIC arguable. For example, software may work properly until the system where it runs on upgrades a library it depends on (an event that might not be recorded in version control). Note that in this scenario the same snapshot does not exhibit the bug before the library upgrade, but exhibits the bug after.

In such a case, the changed lines by the BFC were not the cause of the bug (these lines were correct until the upgrade). Our proposed model establishes criteria that allow researchers to determine that the snapshot after the upgrade did not introduce the bug but, it exhibited the bug for the first time.

In the previous example, the snapshot that first exhibited the bug was the one that was run after the library upgrade. However, which snapshot exhibits the bug? The one before the library upgrade, or any version that exhibits the bug after the library upgrade? Currently, there is not a common way to assess that the changes identified as first exhibiting the bug by current approaches (Śliwerski et al. 2005; Kim et al. 2006; Thung et al. 2013) are true/false positives/negatives since they do not have into account this example.

Hence, in this paper, we set out to address the following question:

“How can we identify the origin of a defect based on information in source control systems?”

1.2 Research Questions

In particular, to answer our central question, we first defined specific criteria that help determine whether a change in the source code introduced a bug, and the moment this change was introduced. Then, we studied these criteria in some real-world cases. Thus, we addressed the following research questions (RQs):

RQ1: Is there a criteria to help researchers find a useful classification of changes leading to bugs? Motivation: Our designed model provides defined criteria to decide whether a certain bug is present in a snapshot. However, we need to ensure that these criteria can be applied to real-world projects to determine whether a change in the source code introduced a bug. Thus, we used the model to understand and classify the root cause in 116 bugs. This process produced two manually curated datasets that contain a collection of bugs, and information on a) the change to the source code that introduced the bug, or b) the absence of such a change.
RQ2: Do these criteria help in defining precision and recall in four existing SZZ-based algorithms for detecting bug-introducing changes? Motivation: The positive answer to RQ1, at least for some cases, helped us create manually curated datasets that may be considered as the “ground truth” for some bugs. We use this “ground truth” datasets to compare four existing SZZ-based algorithms that identify BICs and compute their performance (in terms of precision, recall and F-score), and compare them against each other. The analysis of the results helps to find ways to improve them.

1.3 Contributions

This work is a further development of our preliminary work (Rodríguez-Pérez et al. 2018b), which we are extending with the following main results, based on prior literature and empirical findings:

1.
A model that, given a BFC, describes when the corresponding bug was introduced, consisting of (i) a set of explicit assumptions on how bugs were introduced, (ii) specific criteria for deciding whether a bug is present in a snapshot, (iii) a process for determining which change in the source introduced the bug, or the knowledge that it was not introduced by a change, and (iv) a proposed terminology of the components that play a role in the bug introduction process.
2.
An operationalization of the process to determine which change first exhibited the bug that can be used to (i) classify the bug as intrinsic or extrinsic, (ii) identify the first snapshot that contains the bug.
3.
A unified terminology with all relevant concepts involved in the origin of bugs. A common terminology is needed because we have found in the literature that scholars use different wording for the same concepts or, even worse, use the same wording for different concepts. This situation hinders the understanding of the bug origin problem and can be solved with a unified terminology.
4.
An empirical study on two open source software systems (ElasticSearch and Nova) that exemplifies how our model and operationalization can be applied to two real open source projects. The result of this study is a manual curated reference dataset that annotates a set of bug fixing changes with the change that introduced the bug, or with the absence of such a change (in our case we do it for a collection of 116 bug reports).
5.
An evaluation of the performance of four existing SZZ-based algorithms for the identification of BICs. This evaluation provides further insights on how these algorithms could be improved.

The rest of this paper is structured as follows. We first introduce some motivating examples in Section 2 to support the convenience of developing a model to describe how bugs were introduced. Related work is presented in Section 3. Then, we introduce the general framework and the assumptions we consider, in Section 4. Section 5 describes the model, the associated terminology and the process to determine which change first exhibited the bug. Then, Section 6 details the operationalization of these process. Section 7 introduces the case studies and the empirical results. Section 8 discusses potential applications, guidelines and improvements, and reports on threats to validity. Finally, we draw conclusions and point out potential future research in Section 9.

2 Background and Motivation Examples

Software is prone to defects due to its inherent complexity and the developers’ difficulties to understand its design (Itkonen et al. 2007). Therefore, defects and how they are introduced in code have been an active area of research (see Basili and Perricone (1984), Mockus and Weiss (2000), and Boehm and Basili (2005) for some seminal work on the matter of understanding and classifying how defects are introduced). According to IEEE Standard 1044 (2009), a defect is “an imperfection or deficiency in a work product where that work product does not meet its requirements or specifications and needs to be either repaired or replaced”. When the defect is present in software, it is considered a “fault” (manifestation of an error in software). A defect/fault can be introduced in different phases of a software product life (e.g., planning, coding, deployment) due to many reasons, such as missing or changing requirements, wrong specifications, miscommunication, programming errors, time pressure, poorly documented code, among others (Nakajo and Kume 1991; Jacobs et al. 2007; Nuseibeh and Easterbrook 2000). When the software is executed and the system produces wrong results, defects may lead to failures, described as “[e]vents in which a system or system component does not perform a required function within specified limits” (Institute of Electrical and Electronics Engineers and IEEE Computer Society. Software Engineering Standards Committee 2009). Developers, and in many cases researchers too, typically use the term “bug” to refer both to defects/faults (deficiencies) and failures (their manifestation), depending on the context. For example, “fixing a bug” usually means “fixing a failure by correcting the faulty code” while “reporting a bug” means “reporting a failure”. A single fault may lead to several failures and, in some cases, a single failure may be caused by several faults. Through this paper, we will use in general the term “bug”, trying to specify, when that is relevant and is not obvious from the context, if we refer to failures or faults. We will also assume that when a “bug is fixed” means that “a failure was fixed by correcting at least one fault”. In general, we will be interested in the first fault (per order of introduction in the source code), in case there are more that one causing a failure.

However, neither IEEE 1044 (2009) nor ISO/IEC 9126 (2001) provide a way of determining whether some code can be considered buggy (or faulty) when it was written. Of course, researchers and developers may know if some code is considered faulty when a certain failure is fixed, but that is not enough to know if it could also be considered faulty when it was written, or at that time it was perfectly correct, according to the context of the system at that moment. The lack of definitions and some previous unconsidered origins^{Footnote 3} for bugs (Rodríguez-Pérez et al. 2018b) cause difficulties to correctly identify which change introduced a fault, and even if the fault was introduced by it, or by a later change in the context of the system. Furthermore, with a precise definition of “introducing a fault” (from now on, “introducing a bug”), researchers can identify whether a change that exhibits a given bug is also the change responsible for introducing it (i.e., the bug-introducing change (BIC)) or whether this change corresponds to the first time that the system manifested the bug. In other words, the fact that before a given change the system does not exhibit a bug, but after it, the bug appears, is not enough to consider that the change introduced the bug.

We will refer to this later case with the concept of “first-failing change” (FFC), in the sense that this change did not introduce the bug, but there was a “first-failing moment” (FFM) –not recorded in the SCM– in which the bug manifests itself for the first time. Thus, in this work, when there is an intrinsic bug, the bug-introducing change, the first-failing change and the first-failing moment are the same (see Fig. 1). However, when there is an extrinsic bug, there is no bug-introducing change in the SCM and the first-failing change is the commit in our SCM right after the first-failing moment occurs (see Fig. 2).

Extrinsic bugs are caused by changes that are not recorded in the SCM. These bugs are not the result of introducing faulty code, but might be due to incorrect assumptions, changes in requirements, dependencies on the run-time environment, changes to the environment, bugs in external APIs, among others. As far as we know, this kind of bugs has not been studied before from the perspective of their introduction; this work aims to offer more insights into such bugs. In the next examples, we show some extrinsic bugs and motivate the interest in researching them.

Example 1

Figure 3 shows a bug report from the ElasticSearch project.^{Footnote 4} The bug occurred when downloading a site plugin from GitHub. In this case, the dependency of the source code of ElasticSearch on the GitHub API caused the bug. Around seven months after inserting the original lines, the GitHub API changed and the source code in ElasticSearch became buggy because the plugin no longer worked. Figure 4 shows the lines modified to fix the bug. The original version of these lines did not introduce the bug, but they are the lines where the bug manifested itself (after the change in the GitHub API). Thus, there is no change to the source code of ElasticSearch itself that introduced the bug because when those lines were introduced the GitHub API worked as the developer expected. Table 1 summarizes the existence of the bug-introducing change, first-failing change and first-failing moment in this example.

Table 1 First-failing moment (FFM), first-failing change (FFC) and bug-introducing change (BIC) in Example 1

Full size table

Example 2

Figure 5 offers another bug report from ElasticSearch^{Footnote 5}. This bug pertains to setting permissions in subdirectories; it was caused by the post-installation script setting all data permissions to 644 inside of /etc/elasticsearch, and failing to set appropriate permissions (755) to subdirectories. The only line that was modified to fix this bug was line 37 (see Fig. 6). However, as directories did not exist in /etc/elasticsearch when the original version of line 37 was introduced, we can conclude that there is no BIC. Table 2 summarizes the existence of the bug-introducing change, first-failing change and first-failing moment in this example.

Table 2 First-failing moment (FFM), first-failing change (FFC) and bug-introducing change (BIC) in Example 2

Full size table

Example 3

Some bugs manifest themselves if the software is used in a different environment than it was intended for. Figure 7 shows a bug report in Nova describing a failure when using Windows Server 2012; Windows Server 2012 introduced support for projecting a virtual NUMA topology into Hyper-V virtual machines. Here, as well, there is no BIC, and the manifestation of the bug depends on the environment used. Table 3 summarizes the existence of the bug-introducing change, first-failing change and first-failing moment in this example.

Table 3 First-failing moment (FFM), first-failing change (FFC) and bug-introducing change (BIC) in Example 3

Full size table

The bug in Example 1 manifested itself due to a change to an external artefact upon which the software depends. The bug in Example 2 manifested itself due to an incorrect assumption (in this case, an omission of a requirement). Example 3 shows a bug caused by a change in the environment, as the bug manifested when the software was used in a platform it did not officially support at the time of writing the code. These cases are examples of extrinsic bugs, in which there is no bug-introducing change causing the bug.

As we can observe, extrinsic bugs are not the result of an explicit change in the SCM. Thus, it is necessary to develop new models to describe their origin.

3 Related Work

Traditionally, in mining software repositories, researchers identify the lines of source code that introduced the bug assuming that the last change that touched the fixed line(s) in a bug-fixing change (BFC) introduced the bug (Zeller et al. 2011; Śliwerski et al. 2005; Williams and Spacco 2008). Thus, the introduction of bugs has been studied over the last years from the BFC backward by using two different methods: dependency-based and text-based methods.

Dependency-based approaches use changes in the relationship between control and data in the code. Ottenstein and Ottenstein proposed the first program dependence graph to be used in software engineering (Ottenstein and Ottenstein 1984). This approach achieves higher accuracy than text-based approaches (Sinha et al. 2010) in identifying the bug-introducing change (BIC), taking into account the semantics of the source code, because it addresses some of the limitations of text-based approaches (Davies et al. 2014). However, dependency-based approaches are not appropriate for identifying the origins of all bugs because they have some implementation challenges. For instance, these approaches cannot identify the BIC when the BFC s do not change the method’s dependencies.

On the other hand, the text-based approaches are more popular when identifying the BIC since they pose less implementation challenges (Davies et al. 2014), thus the related work section focuses on these approaches. Text-based approaches are based on textual differences to discover addition, deletion and modifications lines between the BFC s and its previous version, and then backtrack the modification and deletion lines to identify the change that introduced the bug. The approach proposed by Sliwerski, Zimmermann, and Zeller (SZZ) is a popular text-based algorithm (Rodríguez-Pérez et al. 2018a), improving on previous text-based approaches (Čubranic and Murphy 2003; Fischer et al. 2003a; 2003b). As such, it assumes that the last change that touched the fixed line in a BFC introduced the bug (Śliwerski et al. 2005) and relies on historical data to identify changes in the source code that introduced bugs. For that, the algorithm links the SCM and the ITS in order to identify the BFC and then, it identifies the BIC. To that end, it employs the diff functionality to determine the lines that have been changed between the BFC and its previous version and the blame functionality to identify the last change(s) to those lines. Finally, it uses a temporary window from the bug report date until the BFC date to remove false positives.

Since the inception of SZZ two main improvements have been proposed: Kim et al. used annotation graphs to reduce false positives and gain precision by excluding comments, blank lines, and format changes from the analysis (Kim et al. 2006); and Williams and Spacco improved the line mapping algorithm of SZZ by using weights to map the evolution of a line (Williams and Spacco 2008). Many studies have largely used these SZZ algorithms to predict, classify and find bugs. Kamei et al. proposed a model to identify defect-prone changes instead of defect-prone files or defect-prone packages; this model allows developers to review these risky changes while they are still fresh in their minds, which is known as ‘Just-in Time Quality Assurance’ (JIT) (Kamei et al. 2013). Kim et al. showed how to classify file changes as buggy or clean using change information features and source code terms (2008). Tantithamthavorn et al. studied how to improve the bug localization performance assuming that a recently fixed file may be fixed in the near future (2013). Nagappan et al. used the SZZ idea of mapping as the base to associate metrics with post-release defects, and built regression models to predict the likelihood of post-release defects for new entities (2006). Zimmermann et al. used the SZZ to predict bugs in large software systems (2007).

Recently, Da Costa et al. have made an important effort proposing a framework for evaluating the results of five SZZ implementations. This framework assesses the data generated by SZZ implementations and flags changes as not likely to be BIC s. For that, this framework relies on three criteria: (1) the earliest bug appearance which is related to the number of disagreements that SZZ has with the affected-version reported; (2) the impact that a BIC has in future bugs; and (3) the likelihood that the BIC given by SZZ is the real cause of the bug computed as the difference in days between the first and the last suspicious BIC s; if this difference is several years, the commit is removed. Their findings showed that current SZZ implementations still lack mechanisms to correctly identify realBIC s (da Costa et al. 2017). In this work, we describe how to use our model to identify realBIC s, which is one of the the major problems of SZZ algorithms. While Da Costa et al. base their study on the reliability of SZZ results with computing metrics, our aim is to describe a model that can help to reason about whether an earlier change in the SCM caused the bug.

Furthermore, Campos Neto et al. have studied the impact of refactoring changes on SZZ and have proposed the RA-SZZ implementation (Refactoring Aware-SZZ). Refactoring changes are one of the major limitations of SZZ since the algorithm blame them as bug-introducing changes when, in fact, these changes did not introduce the bug because they did not change the system behavior. The authors observed that 6.5% of the lines blamed as BIC s by SZZ were refactoring changes and that 19.9% of the lines removed in a BFC were related to refactoring changes (2018). In addition, Campos Neto et al. re-evaluated the RA-SZZ implementation in Defects4J dataset and observed that 44% of the lines identified as BIC s by RA-SZZ are very like to real BIC s. However, there exist refactoring operations (31.17%) and equivalent changes (13.64%) that are misidentified by RA-SZZ (2019). While Campos Neto et al. assumed that the BIC should be in the evolutionary history of the lines that have been changed in a BFC, our work takes a backward step to understand how bugs were introduced and describe a model that can help with this identification. In our model, the evolution history of the lines that have been changed in a BFC can be derived as a specific case of how bugs were introduced.

More recently, Sahal and Tosun proposed a way to link the code additions in a fixing change to a list of candidate BIC s (2018). The authors state that their approach works well for linking code additions with previous changes, although it still produces many false positives since this approach assumes that the BIC is one of the changes surrounding the new additions in a BFC. Our model helps researchers to understand whether an incomplete change caused a bug and then, the BFC fixed this bug by adding only new lines of source code. However, our model does not assume that the BIC s have to be the changes surrounding the new additions.

In addition, other studies observed serious limitations when using both dependency-based and text-based approaches. These limitations are addressed in the model proposed in this work. Murphy-Hill et al. observed that when developers fix bugs, they have different options as to how to fix them and each decision may lead to a different location where a bug was introduced (2015). Qualitatively, the authors showed the many factors that influence how bugs are fixed, most of them being non-technical. These factors may affect bug prediction and localization because the bug fixing cannot be at the same location as the bug, or because the bug fixing might be covering the symptom and not the cause of the bug. Rodríguez-Pérez et al. performed a systematic literature study on the use of the SZZ algorithm and quantify its limitations (2018a). Prechelt and Pepper offered an overview of the limitations of the text-based approaches when they are used for Defect-Insertion Circumstance Analysis (DICA) (2014). The authors observed that BFC s may have touched non-buggy lines, and even when they touched those lines, the actual BIC may have been made earlier. Also, they stated that bugs and issues are not easy to distinguish in bug trackers, causing low reliability when mapping BFC s with BIC s. In particular, the precision of mapping BFC s with BIC s in their case study was only 50% due to changes considered as bugs that, in fact, were not bug reports (e.g., feature request, refactoring). Furthermore, others authors highlighted limitations to map BFC s with BIC s due to some characteristics of the software that can negatively affect textual approaches. For example, German et al. investigated bugs that manifested themselves in unchanged parts of the software and their impact across the whole system (2009). Chen et al. studied the impact of dormant bugs (i.e., introduced in a version of the software system, but are not found until much later) on bug localization (2014). As opposed to the previous studies that have relied on the lines modified in the BFC s to identify the BIC, this study proposes (1) a model that helps researchers to reasoning whether the origin of a bug is intrinsic or extrinsic; and (2) how researchers can operationalize the model to identify the BIC, when it exists. Our preliminary approach (Rodríguez-Pérez et al. 2018b) was the seed to extend the work and provide a more comprehensive description of how to correctly identify BIC s. Furthermore, in this work we detail the process of using the model and its operationalization to build reliable datasets that can be used to evaluate four existing SZZ-based algorithms.

4 The Framework and its Assumptions

Given a bug-fixing change (BFC), identifying its bug-introducing change (BIC) is not necessarily straightforward as bugs can have different origins as shown in Section 2. Thus, in order to identify when and how bugs were introduced, we designed a model that consists in a framework based on five assumptions. These assumptions enable the framework to describe the first time that the software exhibited the bug according to a BFC.

The model we propose is based on the following five assumptions:

The first assumption allows researchers to track how code changes as it evolves, and to recover any past version of it. The second one enables researchers to identify the BFC, and to link it to the contextual information of how the bug was fixed. The third assumption permits researchers to know when the software exhibited the bug that was fixed in the BFC. The fourth one allows researchers to identify whether the bug has been previously introduced in the SCM. And the fifth assumption enables researchers to decide that the bug is no longer present in the BFC snapshot, but it was present in a previous snapshot.

These assumptions can, to some extent, be implemented with today’s technologies and processes. For some of them, however, we required theoretical conceptualizations and simplifications, as we discuss in an extensive way in the subsequent sections. We, therefore, offer details on how the model implemented each assumption. Furthermore, we inform researchers about known limitations and possible solutions for all assumptions. In those cases where an assumption, due to its theoretical or practical novelty, was elaborated more, we also provide context and introduce the necessary definitions and concepts.

4.1 The Model Assumes that there is Version Control for the Software

4.1.1 Implementation

The model assumes that the development history of the project is recorded in the source code management systems (SCM), and that the record is complete, i.e., it starts from the very first change^{Footnote 6} to the code. Thus, all changes can be tracked because they were done via a version control system (VCS) tool (such as git). For each change we can recover the state of the system (i.e., snapshots of the system) before and after applying that change; and retrieve the differences between the two snapshots.

4.1.2 Limitations and Solutions

Nowadays, the history of a project is recorded in SCM, enabling researchers to reconstruct the process by which the software project was created (Bird et al. 2009). Although old software projects can migrate their history from previous repositories, the migration may not be complete (Gonzalez-Barahona et al. 2014). In addition, the use of SCM imposes some possible limitations that can alter how it was created. For example, changes may have been reordered, deleted or edited (Bird et al. 2009). In particular, commits in a pull-request might be reworked (in response to comments), and only those that are the result of the peer-review can be observed (Kalliamvakou et al. 2014). Another aspect to take into consideration is the effect of gatekeepers, who act as a filter/dispatcher for the incoming changes (Gousios et al. 2015; Canfora et al. 2011).

4.2 The Model Assumes that it has means to Identify the Bug-Fixing Change (BFC)

4.2.1 Implementation

When a bug report is closed by a BFC, the model assumes that it has means for linking the BFC with the bug report. If the system also uses a code review system, the model assumes there is a way to find the discussion corresponding to a given BFC. Therefore, a bug report can be linked to its BFC and the information related to its review.

4.2.2 Limitations and Solutions

Several studies that focus on issue tracker systems used to collect bug reports or feature requests have demonstrated that a substantial part of bug notifications are not correctly categorized, and are functionality requests or suggestions for refactoring. Herzig et al. reported 33.8% (2013), while Rodríguez-Pérez et al. reported up to 40% (2016). In addition, Herzig et al. pointed out that 39% of files marked as defective have never had a bug (2013).

Furthermore, when the bug notifications are correctly identified as a bug report, previous studies indicate several limitations of linking the BFC with the bug report. For example, the fixing commit cannot be linked to the bug (Bird et al. 2009), or the fixing commit was linked to a wrong bug report, as they do not correspond to each other (Bissyande et al. 2013).

A number of tools have been developed to increase the linkage between bugs and fixes, among others, EpiceaUntangler (Dias et al. 2015), BugTracking (Rodríguez-Pérez et al. 2016), Relink (Wu et al. 2011), Rclinker (Le et al. 2015), or Frlink (Sun et al. 2017). The model can use them in order to reduce these limitations, at least partially.

4.3 The Model Assumes that it is Possible to Know Whether a Bug is Present in the System or Not

4.3.1 Definitions and Concepts

To study the origin of bugs, our model needs to unequivocally determine if the bug is present for any given snapshot of the software system. In this way, we will be able to know when the bug appeared and when it has been fixed.

We need to consider what it means that “the bug is present”. Since there is no definition for ensuring that a bug is present in a snapshot, we build upon the definition of “defect” by IEEE Standard 1044 (2009):

“Defect: An imperfection or deficiency in a work product where that work product does not meet its requirements or specifications and needs to be either repaired or replaced.”

We will slightly adapt this definition in three ways: i) we will use the term “bug”, ii) we are only concerned with “software products”, and iii) we will add temporal behavior, by adding “at the moment of producing the snapshot”. The adapted definition will be as follows:

“Bug: An imperfection or deficiency in a software product where that software product does not meet its requirements or specifications, as defined at the moment being considered, and needs to be either repaired or replaced.”

Therefore, to know if a bug is present in a certain snapshot of the product, the model will check if it meets requirements or specifications at the moment of the production of the snapshot. This introduces an essential aspect as some lines of code might be considered a bug for a certain snapshot, because of the specifications at that point. However, the exact same lines could be considered correct if present in another snapshot if at that point some other specifications were applicable and were met (e.g., in Example 3 in Section 2).

As a result, we can define: A bug was present for the first time in the first snapshot where the fixed code can be considered incorrect in any branch that ends merged in the BFC’s branch, according to the specifications applicable to that snapshot. This definition considers that the bug can propagate several times, e.g., in multiple branches that lead to the BFC.

When developers fix a bug, they can write a test that fails if the bug is present (Beller et al. 2018). Thus, if developers could run such a test for every snapshot, they would see that the bug is not present in those snapshots where the test passes. We consider a test as perfect, if it can be run on any past version of the software.

This perfect test is a theoretical construct that may be challenging to create in practice. However, it provides an essential and precise definition of “faulty code at the time of writing it”. Furthermore, this perfect test can be seen as a kind of regression test^{Footnote 7} which will evolve and adapt depending on the software’s changing circumstance (e.g., dependencies, APIs, even requirements) for each past version. The perfect test would encompass all the knowledge about the behavior of the software in the past, thus forming an oracle for each previous version.

4.3.2 Implementation

Our model assumes that it is possible to know whether a bug is present in a system or not by using perfect tests. These tests would create a signal that pinpoints when the bug was present. For that, they can also be used with past snapshots, before the bug was fixed. Theoretically, these perfect tests would fail according to our previous definition^{Footnote 8}.

The idea of perfect knowledge replicates the idea of the global observer in distributed systems (Chandy and Lamport 1985); it is an idealized situation (i.e., difficult or even impossible to implement), but a beneficial concept for reasoning about the system, and for comparing practical implementations and algorithms.

In order to run the tests for previous snapshots, these tests might have to be updated “for past conditions”, i.e., they have to be adapted to structural changes in the system under test (Moonen et al. 2008). In addition to the tested module, the tests need their dependencies: libraries, compilers or interpreters, external components and maybe even services accessed via remote APIs Zaidman et al. (2008, 2011). Thus, a test fails or passes not only for a certain snapshot, but for a certain snapshot of all those dependencies.

Dependencies can be considered as a part of the requirements (Mens et al. 2005): the module is expected to work, at any given moment, with a certain set of dependencies. Thus, the test should pass for that set. However, when dependencies change, the test may start failing, even if it is run on the same snapshot (Zaidman et al. 2011; Demeyer et al. 2002; Moonen et al. 2008; Marsavina et al. 2014; Palomba and Zaidman 2019; 2017). For example, the module can be expected to work with Python 2, but at some point the project decides that it should also run with Python 3. That will break large parts of the code, and many tests will fail when the new interpreter is introduced. Therefore, tests need to evolve to take into account the new dependency, in the same way they need to evolve to take into consideration any change in requirements.

Thus, the final definition of bug that we use in this work is:

“Bug: An imperfection or deficiency in a software product thatcauses a given test to fail. The test will be defined for each snapshot of the product, according to the requirements and specifications applicable for that snapshot, and for the dependencies supported in it, and will fail for a snapshot only if the bug is present in that snapshot.”

Although this definition may be difficult to implement in practice, it provides an accurate test to know when a bug is present, and therefore, when it is introduced. Assuming the model has perfect knowledge about the requirements, specifications, dependencies, and perfect tests are available, it can clearly describe when the bug is present, and from there on, it also knows when the bug was introduced, and how.

4.3.3 Limitations and Solutions

Being able to gather information of previous requirements, documentation or dependencies of a project in previous versions is not always easy, as shown by Zaidman et al. (2011). Some projects use build tools such as Maven or Gradle, and researchers can analyze the build scripts looking for dependencies or plugins that have changed. But, in other cases there is no formal record of such information. Thus, in the usual case a perfect test is not feasible. However, the contextual information found in issue tracker systems, code review systems and control version systems may help to write the tests, and to identify the origin of bugs.

Knauss et al. studied how the open communication paradigm in software ecosystems provides opportunities for ‘just-in-time’ requirement engineering (RE) (2014). They propose T-Reqs, a tool based on git that enables agile cross-functional teams to be aware of requirements at system level and allows them to efficiently propose updates to those requirements (Knauss et al. 2018). This tool can support successful implementation in our model, since researchers can match changes/updates in the requirements with the changes in the source code and then, our model can use this information to build the perfect knowledge.

4.4 The Model Assumes that it is Possible to Identify a Candidate of the Bug-Introducing Change (BIC) that Corresponds to the Bug-Fixing Changes

4.4.1 Implementation

To identify the BIC, the model assumes that there is a perfect test for the fixed bug. Any approach that uses the representation of the model should start by analyzing how to link the BFC to the contextual information of how the bug was fixed. Then, it can start looking for the corresponding BIC.

Finally, once the approach has the test for each snapshot, it runs the test for all the previous snapshots until it finds the first snapshot that fails according to a BFC or until the test cannot be run or build because the tested functionality is not implemented yet.

The theoretical possible outputs of the test are:

The test passes for all snapshots. This means that the bug was never present until the BFC. This is impossible because if the test is perfect, that would mean there was no bug to fix. So, the model ignores this case.
The test fails for at least some of the snapshots. This means that there will be a first snapshot for which the test fails. That snapshot will be the candidate BIC. It can be no other, because if the bug was in an earlier or later snapshot, the test would also fail for it.
The test is not-runnable or not-building. The model does not consider these scenarios since it assumes that perfect tests can be updated to previous snapshots.

Once there is a candidate for the BIC, researchers can analyze why the test failed and determine whether this change introduce the bug of not:

If there was no change in the source code that made the test fail, but the reason for the failure of the test was a change in requirements, specifications or dependencies, the candidate BIC is not responsible for introducing the bug. The change will be considered as the FFC. The model assumes that the bug is extrinsic because there is no new code causing the test to fail – the code introduced was correct (at least with respect to this bug).
In any other case, the model assumes that the bug is intrinsic because the change includes code that causes the test to fail. Therefore, the candidate BIC is the BIC.

4.4.2 Limitations and Solutions

In practice, when manually inspecting the changes, we may not need perfect knowledge; we only need to be able to assert on whether the definition of a bug is fulfilled. We also need to consider that when we roll back into earlier snapshots, we could find a moment when the test cannot be run because the feature being tested was not implemented at that moment. Even in the presence of build automation tools such as Maven, it is sometimes not that easy to go back in time to rebuild a project (Zaidman et al. 2011). Moonen et al. have shown that about 2/3 of the refactoring changes from Fowler et al. (1999) can actually result in non-building test cases because the refactoring changes the original interface and the test code requires a change in the types of classes that were involved in the refactoring (Moonen et al. 2008). In contrast, Hilton et al. have recently performed a study on test coverage evolution using Continuous Integration builds (Beller et al. 2017), reporting that this modern infrastructure eases building prior versions of a software project considerably (Hilton et al. 2018).

We could consider implementing these perfect tests by automatically generating them, e.g., using EvoSuite (Fraser and Arcuri 2013a; Palomba et al. 2016). However, automatically generating tests raises a number of issues. First, the generated test may not run or build in previous snapshots. Second, the test may not be precise enough since there will be lack of information to understand and implement the specifications and requirements. In fact, even if developers can implement the perfect tests manually because they have enough information, the results are not binary, as they might return four values: Pass, Fail, Not-Runnable and Not-Building. The test should return not-runnable when the feature to test is not present, and return not-building when there is an issue with the dependencies trying to be built in that snapshot (Zaidman et al. 2011; Moonen et al. 2008).

Nevertheless, researchers can use some test generation tools like EvoSuite (Fraser and Arcuri 2013a, b) to further investigate and solve theses issues. In particular, in future work we can investigate targeted search-based strategies to update tests after, e.g., refactoring operations (Vonken and Zaidman 2012).

Finally, another limitation is the assumption that the requirements in previous snapshots were always correct. If we combine that with the assumption that the tests are perfect and we can update them for conditions in the past, we run the risk of running into faulty requirements in previous snapshots (Viller et al. 1999). If we roll back the tests in this situation, the tests are likely to not fail.

4.5 The Model Assumes that the Fix is Perfect

4.5.1 Implementation

This means that the bug is no longer present after being fixed (i.e., after the BFC), and the bug report will not be reopened in the future. To ensure that the bug is no longer in the system, the model again uses the concept of perfect tests: if the snapshot of the BFC passes the test, the model ensures that, under the same specifications and requirements, the bug has been removed. We would then have what we call perfect fixing.

4.5.2 Limitations and Solutions

Perfect fixing is not always possible in practice and the bug report might need to be reopened (Zimmermann et al. 2012; Shihab et al. 2013).

In some cases, bug reports are reopened because they were not correctly fixed. Xia et al., reported that 6%-26% of the bug reports in Eclipse, Apache HTTP and OpenOffice.org were reopened. In this context, they proposed the ReopenPredictor tool which uses various kinds of features such as raw textual information or meta features to build a classification-based framework and predict whether a bug report would be reopened (2015).

Furthermore, Zimmermann et al. investigated the reasons why bug reports were reopened at Microsoft. Their findings showed that bug reports were typically reopened because either a tester did not provide enough information in the report and there was a misunderstanding about the cause of the bug, or the bug was a regression bug^{Footnote 9} (Zimmermann et al. 2012).

4.6 Summary of the Assumptions

Table 4 summarizes the need, limitations and possible solutions for each assumption of the model.

Table 4 Summary of the assumptions, their limitations and possible solutions

Full size table

5 The Model

In this section, we formally define the notions introduced in Section 2. We do this with two purposes in mind: (1) to identify the first manifestation of a bug in the history of a software product and, (2) to provide the formalisms used to create and describe a manually curated dataset which can be considered as the “ground truth”. It is important to emphasize that the model is not a mathematical model solving relevant equations or characterizing the system, but it is a conceptual model that qualitatively represents the complex bug introduction process and highlights general rules and concepts. To that end, we use an example that identifies the bug-introducing change (BIC) or the first-failing change (FFC) given a bug-fixing change (BFC). This example describes a software product called Project A (PA) which uses an external library called ExtL. Figure 8 shows the model as a black box, with the information of a bug-fixing change as input and a change to the software identified as the bug-introducing change or the first-failing change as output.

5.1 Main Concepts & Unifying Terminology

We found that a unique terminology to name each of the concepts when identifying bug-introducing changes did not exist. We think that a common terminology would be desirable because researchers currently refer to different concepts as the same, and this can cause problems when trying to understand or reproduce previous studies. Table 5 offers a comparison of the terminology used in this work and how the concepts have been referred to in previous publications. To the best of our knowledge, no previous study has presented a comprehensive list of all these concepts and terms used, and neither has someone investigated whether the terms are being used consistently.

Table 5 Comparison of our terminology with the one found in the research literature

Full size table

The terminology describes that developers using the source code management (SCM) to write software in terms of commits, observable changes (additions, deletions or modifications) performed on a file (or set of files). The impact of a commit on a system might be represented as a snapshot, which is a state of the project after the commit has been performed.

Depending on the origin of the bug, we distinguish between: an extrinsic bug which has its origin in a change not recorded in its source code,^{Footnote 10} or an intrinsic bug which has its origin in a change to the source code, this change is the bug-introducing change (BIC). Notice that extrinsic bugs do not have a bug-introducing change but a first-failing change (FFC).

To identify the bug-introducing change, we analyze the changes that fixed the bug in a bug-fixing change (BFC). To fix a bug, the bug-fixing change may add new lines or change (modify or delete) the existing ones. For a commit c, we label modified or deleted, but not added, lines as lines changed by a commitLC(c).

If LC(BFC) ≠ ∅, we can track down whether the revision which last modified each line in LC(BFC) lead to the bug that is fixed in the BFC, e.g., using tools such as “git blame”. This last revision is called the previous commit(pc).

Since the bug-fixing change can change more than one line, it is possible that different lines in LC(BFC) may have different previous commits. We will refer to PC(c) as the set of previous commits of a commit.

But, it is also possible to go further back in time and recursively analyze the previous commits of the LC(pc). These commits are referred to as descendants commits of a bug-fixing change, (DC(BFC)). The previous commits are the immediately previous commits to the lines changed in the bug-fixing change; the descendant commits are all the commits that previously modified the lines changed in the bug-fixing change. The remaining commits in the source code management of a software product from the bug-fixing change backwards are the ancestors commits, AC(BFC), which also includes the previous and descendants commits. Formally,

\( PC(BFC) \cup DC(BFC) \subseteq AC(BFC)\).

5.2 A Process to Identify when and How a Bug was Introduced

This subsection describes the process used by our proposed model (Section 4) to determine when and how a bug was introduced. This process can be generalized and allows us to demonstrate how existing SZZ-based algorithms can be evaluated, which is something missing in the current literature.

This process consists of the following steps, which can be adopted by other researchers as well.

Ensure that a Control Version Exists

The first step is to ensure that the selected project has a development history recorded in a SCM. Furthermore, to identify every change in the code from the beginning of the project until the bug fixing change, we need to ensure that the SCM of the selected project holds the complete history of the project.

Identify the Bug-Fixing Change (BFC)

The second step is to identify the bug-fixing change linked to a bug report. To that end, researchers should analyze only issues labeled (manually or by developers) as bugs reports.

When analyzing a bug fix, it is important to consider that a BFC may fix different bugs; and that fixing a bug might require multiple partial fixes (commits). Furthermore, a BFC can modify other parts of the source code that are not related to the bug, e.g., removing dead code or refactoring the source code (Rodríguez-Pérez et al. 2018a; Neto et al. 2018). Thus, when those cases exist, researchers should only analyze the source code of the BFC that fixed the aimed bug.

Ensure the Perfect Fixing

The third step is to ensure that the perfect fixing exists. A BFC might be incomplete and spread over several commits. In such cases, there is no perfect fixing. However, researchers need to be sure of this fact when analyzing the origin of bugs and they have to identify whether a bug report was reopened or not. In the affirmative case, researchers should consider the last BFC.

Describe Whether a Bug is Present

The fourth step is to describe whether a bug was present in a certain snapshot or not. For that, researchers can use all the information available in the SCM, in the ITS, in the code review system and/or in the testing system to build the perfect test signaling a bug, as explained in Section 4.3.

Thus, in order to describe whether a certain snapshot contains the bug fixed in the bug-fixing change, researchers need to run the perfect test from the bug-fixing snapshot backward. If the test passes, the snapshot does not contain the bug but, if the test fails, the snapshot contains the bug.

Identify the First-Failing Change

The last step is to identify the first-failing change given a bug-fixing change and decide whether it is the bug-introducing change or not. To find the first-failing change, we assume linear history and need to identify the first snapshot in the continuous sequence of test failing snapshots, which finishes right before the bug-fixing change. That is, there is a continuous sequence of snapshots for which the test fails, starting in the possible first-failing change, and finishing right before the bug-fixing change. Since the test is failing –all the way– from this snapshot up to the fix, we can say that this is the first snapshot “with the bug present”, thereby we have identified the first-failing change. Furthermore, if this change introduced the bug, it is the bug-introducing change.

We use the example in Fig. 9 to illustrate how researchers can distinguish both scenarios. Figure 9 shows the timeline of Project A (PA) represented by its snapshots from the bug-fixing change backward, and the timeline of an external library (ExtL) used in PA. The following scenarios are possible when analyzing the first snapshot in the continuous sequence of test failing snapshots:

The bug is intrinsic. The LC(commit) introduced the bug because the lines were faulty. For example, Fig. 9 shows how line 2 added in the previous commit of bug-fixing change inserted the bug. This line uses an external library (numpy) in a wrong way causing the bug to appear and manifest itself for the first time in the bug-introducing change. In this case^{Footnote 11}, the documentation of numpy clearly describes that by default “arange” infers the data type from the input, thereby the line uses numpy in a wrong way causing the bug. This snapshot is the bug-introducing change.
The bug is extrinsic. The LC(commit) did not introduce the bug. For example, Fig. 10 shows how line 3 inserted in a previous commit of the bug-fixing change did not insert the bug because these lines are using ExtL, which contained a bug. In this case,^{Footnote 12} the method array.split() returns an incorrect behavior with array size bigger than MAX_INT32. This snapshot is not the bug-introducing change, but the first-failing change.

6 Operationalizing the Process

This section details how we operationalized the process described in Section 5.2. This operationalization is essential to identify the origin of bugs in real open source projects because the model (Section 5) is based on five idealized assumptions (Section 4).