1 Introduction

Software reusability is the design principle that allows developers to reuse part of the existing software to implement new features (Bieman and Zhao 1995; Soundarajan and Fridella 1998). This practice is widely recognized as one of the key assets of software development, as developers may have multiple benefits, such as the reduction of evolution time, effort, and cost, other than the reduction of risks of source code being affected by defects (Singh et al. 2010; Lange and Moher 1989; Sharma et al. 2009).

When it turns to Object-Oriented programming languages, many software reuse mechanisms have been provided over time. Design patterns (De Lucia et al. 2009; Gamma et al. 1993), third-party libraries (Zaimi et al. 2015; Salza et al. 2020), and programming abstractions (Sommerville 2011) are examples of these mechanisms. Focusing on Java, two very well-known types of programming abstractions are provided to developers: inheritance and delegation (Arnold et al. 2005). The former allows a class to take the properties and attributes of another class, establishing a hierarchical relation between them. The latter refers to when a class invokes an instance of another class to carry out operations without performing any other type of action.

The importance of these mechanisms has been remarked several times by researchers. In the early 90s, Chidamber and Kemerer (1994) included the Depth of Inheritance Tree (DIT), i.e., a metric that measures the number of classes that inherit from another class, in their Object-Oriented metrics suite. Later on, researchers suggested more ways to measure different aspects of inheritance (Breesam 2007; Mal and Rajnish 2013; Rajnish and Bhattacherjee 2008) and delegation (Cherkaoui et al. 1998; Munro 2005; VanHilst and Fernandez 2007), along with best and bad practices on how to use reusability mechanisms (Haefliger et al. 2008; Jalender et al. 2012; Mantyla et al. 2003; Palomba et al. 2014). From the empirical standpoint, a noticeable amount of investigations targeted the role of inheritance and delegation in keeping source code quality under control. For instance, researchers have been studying the relationship between these mechanisms and Object-Oriented metrics (Chhikara et al. 2011; Chawla and Nath 2013; Abreu and Melo 1996), design patterns (Ampatzoglou et al. 2015; Huston 2001), code complexity Albalooshi and Mahmood (2014), and source code maintainability (Daly et al. 1996; Giordano et al. 2022; Prechelt et al. 2003). Perhaps more interestingly, inheritance and delegation metrics have often been employed for building software maintenance predictive models. The key example is defect prediction (Hall et al. 2011; Hosseini et al. 2017), where researchers assessed how reusability mechanisms might contribute to the prediction of future source code defects (Basili et al. 1996; Singh et al. 2010; Yu et al. 2002; Di Nucci et al. 2017; Palomba et al. 2017). Similarly, the contribution of inheritance and delegation has been experimented with for predicting maintenance effort change (Catolino et al. 2020; Nagappan and Ball 2005), code smells (Arcelli Fontana et al. 2016; Di Nucci et al. 2018), software vulnerabilities (Shin et al. 2010), and infrastructure-as-code quality (Dalla Palma et al. 2021).

Despite the availability of a large body of knowledge on how inheritance and delegation mechanisms contribute to the prediction of source code attributes, most of the prediction models defined so far made a strong assumption: developers make use of reusability principles while evolving source code.

First, the extent to which these mechanisms are used in practice might notably impact their contribution to prediction models. Second, it is unclear how the relationship between reusability and source code attributes varies over time and, therefore, whether inheritance and delegation mechanisms should still be considered for prediction purposes as the system evolves.

In this paper, we propose an empirical investigation to fill the limitations of current research concerning the adoption of reusability practices and their evolutionary effects on two specific source code attributes such as defect proneness and maintenance effort. We select these attributes as they represent two interesting use cases to assess reusability mechanisms. On the one hand, these mechanisms are indeed supposed to reduce fault proneness and maintenance effort (Singh et al. 2010; Lange and Moher 1989; Sharma et al. 2009). On the other hand, several prediction models targeted the early location of defects and estimation of the effort required to perform evolutionary tasks (Catolino et al. 2020; Pascarella et al. 2019; Nagappan and Ball 2005).

Our study focuses on Java projects, as Java (1) offers mechanisms that encourage the use of inheritance and delegation (Craig 2007; Tempero et al. 2013) and (2) is still among the most popular programming languages used in industry.Footnote 1 To conduct our experiment, we first mine the Defects4J dataset to extract commit-level information on the reusability mechanisms adoption. Then, we developed statistical models to assess the contribution of reusability mechanisms on defect proneness—as indicated by the number of defects over time—and maintenance effort—as indicated by the code churn of commits. The main results report on the inheritance and delegation usage patterns of the 12 projects considered, highlighting that (1) developers tend to frequently use these mechanisms and (2) their adoption varies over time in a significant manner. Furthermore, we identify a statistical relation, corroborated by a fine-grained qualitative investigation, between the adoption of inheritance and delegation and both defect-proneness and maintenance effort, hence concluding that software reuse is a relevant component that affects the way source code quality evolves.

This paper extends our registered report accepted at the 38th IEEE International Conference on Software Maintenance and Evolution (Giordano et al. 2022). While in our previous work, we defined the research goals of the study and the envisioned data collected analysis methods, this submission analyzes the study’s results achieved and discusses the implications, lessons learned, and actionable items that our work has for researchers and practitioners.

Structure of the Paper

Section 2 overviews the research literature connected to our work, pointing out the main differences that let our investigation advance the state of the art. Section 3 defines the study’s research questions, other than the research method applied to address them. In Section 4, we discuss the study’s results, while in Section 5, we report on the implications that our findings have for researchers and practitioners. The main limitations of the study and the way we mitigated them are discussed in Section 6. Finally, Section 7 provides some final remarks.

2 Background and Related Work

In this section, we first provide background information on the most widely used mechanisms in the Object-Oriented programming languages for reusing code: inheritance and delegation. Then, we survey the related literature targeting code reusability and its impact on source code.

2.1 Background: Inheritance and Delegation Mechanisms in Java

Our study focuses on Java and, for this reason, we describe the way inheritance and delegation mechanisms can be employed in this programming language. In particular, in Java there are two forms through which it is possible to define a hierarchical dependency between two classes:

‘extends’. Given two classes A and B, A is defined as super-class of B if B inherits variables or methods by A. In Java to establish this super-class – sub-class relation the sub-class must indicate it through the keyword “extends”.

‘implements‘. Given a class B, and an interface A, we will claim that B inherits from A if B implements the interface A. In Java this mechanism is provided using the keyword “implements”. In particular, when a class A inherits using an interface, it must provide a concrete implementation of methods defined as a blueprint on interface.

These definitions recall the concept of reusability in terms of specification inheritance, implementation inheritance, and delegation (Bruegge and Dutoit 2009). From a practical point of view, the first one refers to the possibility of replacing an object A with an object B using a combination of two principles:

  • Strict Inheritance. When a sub-class B exposes behavior and properties of super-class A without making any changes (Bruegge and Dutoit 2009).

  • The Liskov Substitution Principle. According to Liskov and Wing (1994), given two classes A and B, B is a sub-class of A if is possible to substitute the object A with the object B every time that the object A was expected.

The implementation inheritance occurs when a class indirectly reuses a super-class source code. The sub-class can wholly or partially override methods and/or properties and replace the super-class’s original behavior with its own. However, the implementation inheritance violates, by definition, the encapsulation principle because a sub-class could accidentally invoke methods or use some proprieties of the super-class in a wrong manner (Bruegge and Dutoit 2009). To avoid this, it is possible to replace the implementation inheritance with the delegation in some cases. With this mechanism, a class B does not inherit anything from another class A, but B invokes methods of A directly by declaring itself a variable of type A.

2.2 Related Work: The Impact of Inheritance and Delegation Mechanisms on Source Code Quality

Source code reusability has been the subject of several research in the last decades. These touched various angles of the problem, by introducing novel metrics to capture inheritance relations (Chidamber and Kemerer 1994; Breesam 2007; Mal and Rajnish 2013; Rajnish and Bhattacherjee 2008) and delegation (Cherkaoui et al. 1998; Munro 2005; VanHilst and Fernandez 2007), defining best design practices to exploit the benefits of reusability (Haefliger et al. 2008; Jalender et al. 2012), or identifying a number of source code quality issues that reusability can cause, e.g., code smells (Mantyla et al. 2003; Palomba et al. 2014; Fowler 2018). While the scope of our work targets inheritance and delegation mechanisms, it is worth mentioning the existence of close research areas such as the analysis of design patterns (Fontana et al. 2013; Zhang and Budgen 2013) and third-party libraries (Zhan et al. 2021). These are additional perspectives that we plan to investigate as part of our future research agenda, but that we leave out of the scope of this paper.

Reusability and Code Quality

As for the themes of our study, Albalooshi and Mahmood (2014) conducted an empirical analysis on the implementation inheritance by considering three programming languages like C++, Python, and Java. As a result, the authors found that the mechanisms of Java to define inheritance tend to degrade source code quality. Goel and Bhatia (2013) obtained similar results by analyzing the impact of multilevel inheritance on reusability considering three C++ projects. They found a negative correlation between the use of inheritance and the quality of source code in terms of maintainability. Other research efforts targeted the effect of inheritance and delegation on various aspects of source code quality. Chhikara et al. (2011) conducted a case study on one small-scale software project, reporting on the correlation between inheritance metrics and other metrics belonging to the Chidamber and Kemerer suite. Chawla and Nath (2013) took a closer look at how inheritance and delegation metrics may impact software coupling, concluding that these metrics can be useful to assess code quality. Similar findings were reported by Abreu and Melo (1996). Additional experiments were conducted to assess the relation between reusability and design patterns (Ampatzoglou et al. 2015; Huston 2001) and code complexity (Albalooshi and Mahmood 2014): all these studies converged toward the relevance of inheritance and delegation. More recently, we carried out a study to investigate the evolution of inheritance and delegation and their impact on the severity of code smells (Giordano et al. 2022). The results revealed that inheritance and delegation tend to increase over time, but not in a statistically significant manner. However, increasing the adoption of these mechanisms tends to decrease code smells’ severity.

The potential benefits of reusability have led researchers to use inheritance and delegation metrics within prediction models. In this respect, most of the defect prediction models include reusability as a feature (Hall et al. 2011). Perhaps more importantly, these metrics have been sometimes shown to significantly contribute to the predictions of those models: for instance, Jureczko and Madeyski (2010) showed that the Depth of Inheritance Tree metric is among the best predictors of source code defectiveness. These results were later confirmed by other software maintenance and evolution researches (Singh and Chug 2017; Jureczko and Spinellis 2010).

Reusability and Maintenance Effort

From an empirical side, Prechelt et al. (2003) carried out two experiments to investigate the relation between inheritance metrics and maintenance effort estimation. Their results revealed that maintaining a low level of inheritance depth positively impacts the (decrease of) developer’s effort to maintain source code. Similarly, Daly et al. (1996) showed that as the inheritance depth level increases, so does the effort of developers to maintain code.

In terms of maintenance effort estimation, researchers have been mainly looking at process-level information (e.g., team data and measurements of the development activities), attempting to provide indications in terms of direct and indirect estimations of entire projects under maintenance (Wu et al. 2016). Besides that, researchers have been also working on effort prediction of maintenance activities, which revolves around the prediction of the effort spent in performing specific activities such as code review (Mishra and Sureka 2014) and bug fixing time (Anbalagan and Vouk 2009; Bougie et al. 2010). The contribution provided by reusability metrics to those models are, however, unclear. Recently, Nagappan and Ball (2005) and Liu et al. (2017) proposed the use of code churn, i.e., the amount of lines of code modified within commits, as an alternative metric of maintenance effort which better aligns with the actual effort spent by developers while performing evolutionary tasks.

Our work

With respect to the papers discussed above, ours has multiple differences. First, most of the previous work analyzed reusability by relying on the computation of metrics, e.g., Depth of Inheritance Tree (DIT); as further elaborated in Section 3, we operationalize reusability by means of specification inheritance, implementation inheritance, and delegation, being able to better map the adoption of reuse mechanisms over time. Second, we conduct a fine-grained analysis where the evolution and impact of reusability are investigated at commit-level. Furthermore, we address a key limitation of most previous works proposing prediction models: the contribution of code reuse to their capabilities indeed assumes that developers make use of reusability mechanisms. As such, our study provides more detailed insights into the potential benefits brought by inheritance and delegation to state-of-the-art prediction models.

3 Research Questions and Methods

The goal of the study was to (1) investigate the adoption of reusability mechanisms over time and (2) assess their impact on defect-proneness and maintenance effort. The purpose was to understand whether those mechanisms can provide developers with an indication of source code quality variation—considering the defect-proneness and effort to fix faults of a project. The quality focus was on the reusability in terms of implementation inheritance, specification inheritance, and delegation and their evolution within software projects. The perspective was that of practitioners and researchers: the former are interested in understanding whether the reusability mechanisms can be suitable for monitoring the quality of a system, while the latter are interested in improving their knowledge on how inheritance and delegation mechanisms can vary over time and impact source code quality. The context of our investigation was composed of publicly available Java projects, as detailed in Section 3.1.

Based on the goal of our study, we formulated three main research questions. The first aimed at understanding the use of source code reusability mechanisms by developers during software evolution. Specifically, we asked:

figure a

The goal of RQ\(_1\) was that of providing insights on the evolution of reuse mechanisms that might later be exploited to better interpret the findings of RQ\(_2\) and RQ\(_3\). In other terms, the patterns observed in the context of this research question will also be useful to understand the effects of inheritance and delegation on defect-proneness and maintenance effort, e.g., should we identify an exponential growth in the adoption of delegation, this would potentially make this mechanism more relevant for software evolution, hence influencing more the amount of effort required to apply modifications.

Since we analyze three mechanisms for reusability, i.e.,  specification inheritance, implementation inheritance, and delegation (Bruegge and Dutoit 2009), that can impact differently on software evolution, we considered three sub-research questions:

RQ\(_{1.1}\). How does the use of implementation inheritance vary during software evolution?

RQ\(_{1.2}\). How does the use of the specification inheritance vary during software evolution?

RQ\(_{1.3}\). How does the use of delegation vary during software evolution?

Once the evolution of reusability mechanisms was analyzed, we investigated how the evolution might affect code quality, initially measuring it in terms of fault-proneness. Hence, we asked our second research question:

figure b

Finally, we assessed the impact of reusability mechanisms on the maintenance effort required to fix faults. Among the various direct and indirect metrics available in literature (Wu et al. 2016), we operationalize maintenance effort through code churn, that is, the amount of lines of code modified within a commit. This is an indirect metric that can proxy the actual effort spent by developers when maintaining source code (McIntosh et al. 2011; Munson and Elbaum 1998; Wu et al. 2016). In particular, we asked:

figure c

Figure 1 overviews the research process applied to address our research questions. After a first phase of data extraction, where we collected data about inheritance, delegation, and other code quality indicators, we integrated the various pieces of information for further analysis. In this way, the research questions were addressed by employing statistical tests and models (see details in Section 3.3). To design and report the empirical study, we followed the guidelines proposed by Wohlin et al. (2012) and the ACM/SIGSOFT Empirical Standards.Footnote 2 We made all the experimental materials (e.g., datasets, scripts) publicly available in an online appendix (Giordano et al. 2022).

Fig. 1
figure 1

Overview of the research process applied in the study

3.1 Context of the Study

The context of the study was composed of Java projects available within the Defects4J dataset, which collects information on over 800 real bugs of open-source systems. According to the official documentationFootnote 3 each bug collected into the dataset is characterized by the following properties:

  1. 1.

    It is reported in the issue tracker of the project, has an associated commit message for resolution, and it is fixed in a single commit, i.e., the defect resolution never refers to more than one commit;

  2. 2.

    It is associated to a triggering test case that allows its reproduction;

  3. 3.

    It is minimized, meaning that the Defects4J maintainers manually removed commits that would have induced noise, namely commits that did not actually provide information about the introduction of defects or fixing activity (e.g., commits where refactoring activities were done);

  4. 4.

    The fixing activities modified the source code. This means that the defect introduction can be caused by several factors, e.g., wrong parameters in configuration files and problems in the production class. However, the corresponding fixing only concerns changes within the source code.

By design, the dataset does not include all the defects reported in the issue trackers of the considered projects, but only those matching the inclusion criteria reported above. In this respect, there are some considerations to make. First, these criteria led to the definition of a set of defects having two key properties: (1) All the defects were true positives, verifiable, and traceable, meaning that there exists at least one test case letting the defective behavior of the code emerge, other than precise indications on the inducing-fix commit pairs reported by the developers, which were instrumental for our analysis, as further discussed in the following sections; (2) The dataset avoided, by design, possible bias due to the presence of uncontrolled conditions, e.g., tangled changes (Herzig et al. 2016), that might have notably affected the validity of the conclusions reported by our study, e.g., refactoring actions targeting inheritance and delegation which were not related to defect fixing operations.

As a consequence of these two properties, the choice of Defects4J enabled the investigation of the impact of reuse mechanisms in a noise-free environment in which we could have provided more precise insights into the actual role played by inheritance and delegation. In any case, we are aware that the dataset contains a subset of the defects included in the issue trackers of the considered projects and that the missing analysis of some defects might potentially bias our conclusions. In response to this potential threat to validity, we (i) analyzed further the anatomy of the dataset to better characterize our sample - this is discussed in the remainder of this section; and (ii) conducted additional analyses aiming at assessing the types of defects that were not included in our analysis - these are part of Section 6.

In addition to the discussion on the use of Defects4J, it is worth remarking that, despite the defects being carefully selected, those defects are of different types and natures, hence representing various defects affecting real-world software systems (Sobreira et al. 2018). Last but not least, Defects4J has been widely used in literature (e.g., Martinez et al. 2017; Durieux et al. 2015), hence representing a valuable asset that enables us to build additional knowledge on a state-of-the-art dataset - this would also be useful for other researchers interested in building on top of our work.

Table 1 Characteristics of the projects considered in the study

As mentioned in Section 2, little has been done to analyze code reuse mechanisms over time and how those may contribute to explaining fault-proneness and maintenance efforts during software evolution. For this reason, our analysis focused on the analysis of code reuse mechanisms from a low granularity perspective, i.e., commits. We analyzed over 44,900 commits. With respect to our initial plan (Giordano et al. 2022), we had to discard five projects from the total amount of systems available in the dataset. This was mainly due to repository inconsistencies caused by developers’ removal of defective commits. Table 1 reports statistics of the projects included in the Defects4J dataset. For each project, the table provides (i) the number of defects, (ii) process metrics such as number of commits, number of pull requests, and number of contributors; (iii) its minimum and maximum LOC; and (iv) if the project could have been analyzed. More particularly, we exploited the latest version of Defects4J (v2.0.0). The defects contained in this version were identified by the original authors using Java 1.8, which is the Java version used by all the projects considered in the study. The reliance on Java 1.8 had some implications on the number of defects reported in the dataset. More particularly, some behavioral changes introduced under Java 8 did not allow to verify anymore 29 of the defects reported in previous versions of Defects4J. As such, these 29 defects were considered deprecated and no longer relevant in Defects4J 2.0.0. In the light of this consideration, we excluded them from our study. These defects indeed violated the first property mentioned above: on the one hand, they were not verifiable; on the other hand, they were not necessarily true positives, as they were re-labeled by the original authors as non-defective when verifying them through the most appropriate Java version, namely the one employed within the corresponding systems.

3.2 Data Extraction Procedure

To answer our research questions, we quantified the reusability mechanisms employed within the considered software projects. To this aim, we operationalized three metrics capturing reusability mechanisms such as implementation inheritance, specification inheritance, and delegation. We did not rely on existing metrics, like the Depth of Inheritance Tree (DIT) or the Number of Children (NoC) (Chidamber and Kemerer 1994), since we aimed at computing metrics that could have directly expressed the adoption of reusability mechanisms. Indeed, our metrics have a finer granularity and can indicate the exact constructs added by developers during a change/commit, e.g., the inclusion of a new method that delegates its operations or a change in the inheritance structure—this would not be possible using existing metrics, as they just provide the result of the actions done by developers, e.g., the increase of the depth of inheritance tree, without indications of how that was obtained. To compute the implementation inheritance, specification inheritance, and delegation metrics, we used a tool already validated in our previous work (Giordano et al. 2022). It was originally developed by the first author of this paper and compute the metrics following these patterns:

Specification Inheritance. Given a class B, the tool considers the specification inheritance as the arithmetical sum of each interface used by B. For instance, suppose that B inherits methods from two interfaces A and C, and C in turn inherits methods from another interface D. In this case, the specification inheritance for B is 3.

Implementation Inheritance. Suppose that B is a sub-class of A, the tool considers the implementation inheritance as the arithmetical sum of each method in A called by some method in B. For example, suppose that B is a class with N methods, and A a class with just one method call bar(). To increase the number of implementation inheritance by one, one of the methods in B must invoke bar().

Delegation. Given a class A, the tool considers the delegation metric as the arithmetical sum of each non-primitive variable (i.e., variables different from int, double, String, and so on) or variables that do not have a binding type provided by external libraries (e.g., Checkbox offered by javax.swing framework). For each variable, the tool verifies if it is only used to invoke external objects.

The metrics were computed over all the commits of the considered systems and were used to address RQ\(_1\). Specifically, for each commit we computed the sum of (i) specification and implementation inheritance uses and (ii) delegation uses by statically analyzing the files involved in the commit. As for RQ\(_2\) and RQ\(_3\), we collected information on defects and code churn. To this aim, we mainly relied on the information made available by the Defects4J dataset. In particular, for each project of the dataset, Defects4J assigns to each defect a unique ID and stores an inducing-fixing commit pair, i.e., a pair of commits reporting when the defect was introduced and fixed, respectively, over the history of the project. Starting from these inducing-fixing commit pairs, we could reconstruct the defect history of each project by overlaying them on the full set of commits of the project and considering as defective all the commits between the inducing-fixing commit pairs. As for the code churn, these were collected by exploiting PyDriller, an automatic static analysis tool that can analyze Git repositories to extract information about commits, developers, modifications, diffs, and source code.Footnote 4 In our case, we run PyDriller over the commits of the considered systems and extracted the number of modifications performed by developers, i.e., the code churn.

The data extraction process described above was curated by the first two authors of the paper. More specifically, the first author was involved in the mining of the change history of the projects, while the second author had the responsibility to write the scripts for mining Defects4J.

3.3 Data Analysis Procedure

The collected data were further analyzed as follows:

  1. 1.

    RQ\(_1\) - Analysis of the evolution of reusability mechanisms over time. To address this research question we analyzed how reusability metrics (implementation inheritance, specification inheritance, and delegation) vary over the evolution of the software systems considered. In particular, we employed basic statistical analysis and visualized results using plots.

  2. 2.

    RQ\(_2\) - Analysis of the impact on defect-proneness of reusability mechanisms over time. In this respect, we built a statistical model to verify how reusability metrics impact the variability of defects in the source code.

  3. 3.

    RQ\(_3\) - Analysis of the impact on maintenance effort of reusability mechanisms over time. Similarly to RQ\(_2\), we built a statistical model to verify how reusability metrics impact the maintenance effort to fix a bug.

Specifically, the statistical models were devised as reported in the following.

Independent Variables. According to our previous considerations, we used the reusability metrics, i.e., implementation inheritance, specification inheritance, and delegation, as independent variables.

Response Variable. In the context of RQ\(_2\) we were interested in understanding how the reusability metrics impact the defect-proneness of software systems over time. Starting from the defect history built by exploiting Defect4J, we modeled our response variable as follows. Let \(C_i\) be a generic commit of the change history of the project P. The number of defects affecting P at the time of \(C_i\) was computed through the \(\#defects(C_i)\) function, which relies on the following system of equations:

$$\begin{aligned} \left\{ \begin{array}{ll} \#defects(C_i) = \#defects(D4J_{C_i}) -\#fixedDefects(D4J_{C_i}),\;\text {if}\; i=1; \\ \#defects(C_i)\! =\! \#defects(C_{i-1}) \!+\! (\#defects(D4J_{C_i}) \!-\!\#fixedDefects(D4J_{C_i})),\;\text {if}\; i\!>\!1; \end{array}\right. \end{aligned}$$
(1)

where \(\#defects(D4J_{C_i})\) indicates the number of defects in Defects4J having as inducing commit \(C_i\), \(\#fixedDefects(D4J_{C_i})\) indicates the number of defects fixed in the commit \(C_i\), computed as the amount of defects fixed according to Defects4J in \(C_i\), and \(\#defects(C_{i-1})\) indicates the number of defects affecting P at commit \(C_{i-1}\). As shown, we had to distinguish the case of the first commit (i=1) from the rest (i>1). When considering the first commit, there cannot indeed be previous fixing operations that influenced the number of defects and, as such, the number of defects at the first commit is only due to the difference between the number of defects pointed out by Defects4J and the number of defects fixed in the same commit. When considering the other commits, instead, the number of defects at the time of the generic commit \(C_i\) is given by the total number of defects at time \(C_{i-1}\) plus the operations performed within \(C_i\), both in terms of defects introduced and fixed. After computing the number of defects affecting the considered systems at each commit, we analyzed how this number varied over time.

Let \(C_i\) and \(C_{i+1}\) be two subsequent commits of the change history of the project P; we labeled the commit pair \((C_i, C_{i+1})\) as stable, increased, or decreased using the \(label(C_i, C_{i+1})\) function described in the following:

$$\begin{aligned} label(C_i, C_{i+1}) = \left\{ \begin{array}{ll} `Stable'\;\;\;\;\;\;\;\;\;\text {if}\; \#defects(C_i) = \#defects(C_{i+1});\\ `Increased'\;\;\;\text {if}\;\#defects(C_i) < \#defects(C_{i+1});\\ `Decreased'\;\;\text {if}\;\#defects(C_i) > \#defects(C_{i+1}). \end{array}\right. \end{aligned}$$
(2)

In other terms, we exploited the information previously collected on the number of defects at each commit of the change history of the project P to describe how the amount of defects varied over time.

In RQ\(_3\), instead, we were interested in assessing the effect of reusability metrics on the effort required to fix defects, as measured by code churn. Starting from the defect history of each project, we considered, as relevant for the research question, the commits marked as fixing commits. Afterwards, we computed our response variable as the sum of the code churn of the files involved in those commits.

Control Variables. We computed a number of control variables. This step was required because the impact on the response variables of the statistical models might be due to various additional factors other than the independent variables. As such, we first computed the Chidamber and Kemerer (CK) metrics (Chidamber and Kemerer 1994), namely DIT (Depth of Inheritance Tree), NOC (Number Of Children), LOC (Lines of Code), LCOM (Lack of Cohesion of Methods), WMC (Weighted Methods per Class), RFC (Response for a Class), and CBO (Coupling Between Objects).

In RQ\(_2\), we also considered the code churn as control variable as suggested by previous findings in the literature (Nagappan and Ball 2005), i.e., we verified whether the variation of the number of defects was due to the amount of changes performed by developers within commits. This metric was not considered in RQ\(_3\), as it was directly connected to the response variable and could, therefore, bias the conclusions.

With respect to the control variables considered in the study, it is important to discuss the role of NOC and DIT. These two metrics are by definition connected to code reusability and measure indeed two aspects related to how developers reuse existing source code through inheritance. We included them with the intent of comparing their statistical power to the reusability metrics considered as independent variables. In other terms, the inclusion of NOC and DIT allowed us to assess the extent to which the reusability metrics we computed represent relevant factors for the response variables when compared to state-of-the-art metrics.

Before building the statistical models, we assessed the presence of possible multi-collinearity concerns. These arise when two or more variables are excessively correlated, possibly biasing the statistical model and the subsequent interpretation of the results (O’brien 2007). In this respect, we followed well-established guidelines (Allison 2012; Lieberman and Morris 2014). For each pair of variables, we computed the Spearman’s correlation coefficient (Taylor 1990). If this scored higher than 0.7, then we removed the variable having the most complex definition to favor explainability-for instance, we preferred keeping the LOC metric rather than WMC to make the interpretation of the results easier. The scripts used to compute the dependent and control variables were developed by the second author of the paper, while the independent variables were computed through the tool originally developed by the first author.

Choosing the Statistical Model. To address RQ\(_2\) we built a Multinomial Log-Linear Model (Theil 1969). This model generalizes logistic regression to multi-class problems, matching our need to have a model able to handle our response variable composed of three values (“stable”, “increased”, “decreased”). As done in our previous work (Giordano et al. 2022), we used R for running the analysis using the function multinom available in the package nnet.Footnote 5

In RQ\(_3\) we had to build a different model because of the nature of the response variable, i.e., code churn. In particular, we built a Generalized Linear Model (Faraway 2016) using the glm function available in R.

The first two authors of the paper were involved in the development of the statistical models. In addition, the interpretation of the results involved all the authors of the paper: these were involved through open discussions and regular meetings with the first two authors.

3.4 Public Availability of Data

To guarantee the replicability of our work and enable other researchers to build on top of our analyses, we made all data and scripts publicly available in our online appendix (Giordano et al. 2022).

Fig. 2
figure 2

RQ\(_1\). Adoption of reusability mechanisms over time

4 Analysis of the Results

In the following sections, we report and discuss the results addressing the research questions of the empirical study. For the sake of comprehensibility, we split the discussion by RQ.

4.1 RQ \(_1\) - On the Variation of Reusability Mechanisms in Source Code

Figure 2 shows how the three reusability mechanisms considered in our study, i.e., implementation inheritance, specification inheritance, and delegation, evolve over time in the considered software projects. Each row of the figure reports the evolution of the metrics for two projects separately. To facilitate the interpretation of the results and enable a more seamless comparison of evolutionary trends across diverse projects, we normalized the reusability metrics by lines of code—in other terms, the figure shows the amount of implementation inheritance, specification inheritance, and delegation mechanisms applied per line of code over the evolution history of the considered projects. These trends were used to interpret the results and address the specific sub-research questions defined in the context of RQ\(_1\).

Fig. 3
figure 3

Increasing - Decreasing Pattern

4.1.1 RQ\(_{1.1}\) - variation of Implementation Inheritance Over Time

As for the implementation inheritance, the trends in Fig. 2 do not always follow a common tendency among the projects.

Increasing - Decreasing Pattern

As shown in Fig. 3, we discovered an initial increasing trend in adopting implementation inheritance in seven projects, i.e., Closure-Compiler, Commons-Cli, Commons-CSV, GSON, Jackson-Databind, Jackson-Dataformat-XML, and Joda-Time, followed by a decreasing usage.

While the shape of the curves varies from case to case, we can still see a common pattern. When we look more closely at these cases, we can identify a similar behavior among the developers of those systems. In all the cases, the adoption of implementation inheritance quickly increased during the first commits, suggesting that developers approached the design of the systems to take reusability into account. Nonetheless, the trend quickly decreased after a while, leading implementation inheritance to be used less and less over time.

This trend leads us to formulate two observations. Firstly, the decline in adoption following a peak could be indicative of a phenomenon known as “design erosion” in the literature (Van Gurp and Bosch 2002). Regardless of the intentions of developers and designers, software design tends to degrade over time due to ongoing changes and increasing complexity, as highlighted by Lehman’s laws (Lehman 1996). This erosion can also be attributed to inadequate utilization of software quality measures, as emphasized in previous research (Do LNQ et al. 2020; Vassallo et al. 2018, 2020). Our findings seem to suggest implementation inheritance is not exempt from this trend, and its adoption is likely to decrease over time.

In the second place, the “increasing-decreasing” trend might have clear implications on how reuse mechanisms should be considered within prediction approaches, e.g., defect prediction. Indeed, the employment of implementation inheritance should be carefully considered, and perhaps the usage trend might even lead to the definition of novel feature selection procedures that monitor the way developers are using certain programming constructs to inform the model of the most promising features to consider in that evolution moment.

Steady-Increasing Pattern

Looking at Fig. 2, we can identify three less common usage patterns. In particular, two projects, namely Commons-Collections (3rd row) and Commons-JxPath (4th row), appear to exhibit a “steady-increasing” trend. The nature of these projects seems to offer a natural explanation for this trend. The former project provides a framework to use efficient data structures in Java, while the latter implements an interpreter of the XPath expression language. Both projects are structured so that most of the source code relies on a core set of classes. For instance, in the Commons-Collections project, classes within the list package establish the foundation for creating various advanced element lists. This seems encouraging developers to employ reuse mechanisms like implementation inheritance.

Stable Pattern

Two other projects, namely Commons-Codec (1st row) and Jackson-Core (1st row) of Fig. 2, follow mostly a “stable” trend. In both cases, the amount of implementation inheritance uses remains constant throughout the evolution. We analyzed the repositories of those projects deeper to better understand this trend. While we could not identify any specific tool or verification procedure conducted by developers to keep reusability under control, we could observe that most of the commits performed over the last years were peripheral (Amrit and Van Hillegersberg 2010), namely, they pertained to packages of the systems other than core. This may explain the observed trend: developers did not modify any central part of those systems, leaving the original design stable and avoiding an excessive effect of design erosion.

Decreasing - Increasing Pattern

Finally, the Commons-Compress project (5th row in Fig. 2) exhibited an anomalous trend which we coined “decreasing-increasing”. After a greater adoption of implementation inheritance, the trend steadily decreased before increasing again, but at a lower rate. Also, in this case, we manually dived into the repository in search of possible explanations. We discovered that after the release of the second version of the project in 2010 (release commons-compress-1.1), the release engineering process of the system changed, passing from annual to monthly releases. This switch caused a substantial rework of the original architecture, replacing existing code with third-party libraries. Consequently, the overall amount of implementation inheritance uses suddenly decreased in favor of other code reuse mechanisms. Afterward, the developers of the system kept the implementation inheritance under control, leading to an increasing usage trend.

4.1.2 RQ\(_{1.2}\) - Variation of Specification Inheritance Over Time

When considering the specification inheritance, the usage patterns identified in RQ\(_{1.1}\) still hold. In particular, we observed the same “increasing-decreasing” trend in Commons-Cli, while in Commons-Codec a “stable” trend. These findings seem to suggest the existence of a possible strict (cor)relation between implementation and specification inheritance throughout the evolution of software systems, which might depend on the willingness of developers to take (or not) code reusability into account when evolving source code. Part of our future research agenda will consider the effects of this co-evolution of metrics on software quality.

4.1.3 RQ\(_{1.3}\) - Variation of Delegation Over Time

Regarding the delegation, we could observe similar usage patterns discussed above. Nonetheless, we could also discover situations where the evolution of delegation followed an opposite trend with respect to implementation and specification inheritance ones. This is, for instance, the case of Commons-Collections. Indeed, starting from a high adoption during the first development phases, the amount of delegation used kept decreasing till reaching a stable level. This result was, however, somehow expected as inheritance and delegation are alternatives to each other (Bruegge and Dutoit 2009) and, therefore, an increasing use of one may lead to a decreasing use of the other. Similar results were observed when analyzing other projects, e.g., Closure-Compiler Jackson-Core and Compress.

The apparent synergy between inheritance and delegation could offer an opportunity for source code quality predictive models. These models could decide which metrics to focus on at different stages of development. In this way, the models could rely on metrics that can best represent the current state of the system under analysis, potentially improving their predictive capabilities.

figure u
Table 2 RQ\(_2\). Variables removed because of multi-collinearity

4.2 RQ \(_2\) - The Impact of Reusability Metrics on Defect-Proneness

In this sub-section, we report the results when studying the impact of reusability metrics on the defect-proneness of sourace code.

Multi-Collinearity Analysis

Before discussing the results of the statistical model, it is worth reporting the outcome of the multi-collinearity analysis—which was performed to make sure that no correlated variables were employed within the statistical model and could bias the interpretation of the results (see Section 3). Table 2 lists the variables removed after the application of the correlation analysis. In the first place, we found that RFC was the metric most often removed: in all the cases, it was correlated with LOC and, therefore, we preferred keeping LOC because of its highest degree of interpretability. Secondly, in three projects, i.e., Commons-Collections, Jackson-Core, and Joda-Time, the WMC metric was removed, again for its correlation with LOC. We also discovered correlations between DIT and NOC in two projects such as Commons-Codec and Commons-Cli: we kept NOC, namely the metric reporting the number of immediate subclasses of a class. In the cases of Jackson-Dataformat-XML and Joda-Time, we found a correlation between DIT and specification inheritance: as the latter was one of the independent variables, we preferred keeping it. Finally, we identified correlations between specification and implementation inheritance in the projects Commons-Cli and Jackson-Core—these correlations could be already hypothesized looking at the trends observed in the context of RQ\(_1\) : in these two cases, we were obliged to remove one of the independent variables and decided to opt for implementation inheritance.

Table 3 RQ\(_2\). Results of the statistical model

Statistical Model Explanation

Table 3 shows the results of the statistical models built in RQ\(_2\). The independent variables and control variables are reported on the rows, while the various considered systems are reported on the columns—empty cells indicate that a certain variable was removed from the analysis of a specific system as a consequence of the multi-collinearity analysis, while the number of observations (the commits analyzed) for each project is reported in the header of each column. The statistical codes report the p-value for each variable and each project and were used to interpret the results obtained. According to the description reported in the last row of Table 3, a higher amount of ‘*’ implies a higher statistical relevance of a variable with respect to decrease (\(\downarrow \)) or increase (\(\uparrow \)) of the likelihood to affect the defect-proneness of source code.

Statistical Model Analysis

Looking at the table, various considerations can be drawn. First and foremost, in 10 out of the total 12 projects we found at least one of the inheritance metrics to be a statistically significant factor to explain the defect-proneness of the considered systems. The NOC metric, in particular, is the one being relevant in more systems. On 8 projects the metric was observed to explain both the increase and decrease of defect-proneness.

To understand how the metric affects the phenomenon of interest, we analyzed the sign of the coefficients. Specifically, the coefficients of a Multinomial Log-Linear model relate to a reference category and indicate how the variables change the chances of the dependent variable being affected with respect to the reference category—which was set to “stable” in our case. As for the columns “\(\downarrow \)” of Table 3, this means that a negative coefficient for a variable X suggests that for one unit increase of X, the chances that the defect-proneness of source code varies toward a decrease are estimated in the amount indicated by the coefficient, i.e., the higher the coefficient the higher the chance that the variable contributes to decrease the defect-proneness of source code. On the contrary, a positive coefficient implies that for one unit increase of X, the chances that the defect-proneness of source code varies toward the stability are estimated in the amount indicated by the coefficient, i.e., the higher the coefficient the higher the chance of defect-proneness being stable over time. Similarly, in the case of the columns “\(\uparrow \)”, a negative coefficient for X implies that the chances that the defect-proneness of source code varies toward the stability are estimated in the amount indicated by the coefficient, i.e., the higher the coefficient the higher the chance of defect-proneness being stable over time. A positive coefficient would instead indicate that the chances of defect-proneness increasing are estimated in the amount indicated by the coefficient, i.e., the higher the coefficient the higher the defect-proneness of source code.

According to this interpretation, the signs of the coefficients for NOC over the various projects did not report a common pattern. For example, in Commons-compress we observed a positive coefficient of the variable for “\(\downarrow \)” and a negative coefficient for “\(\uparrow \)”, meaning that the variable statistically influences the stability of defect-proneness over time. On the contrary, on the Closure-compiler project the coefficients are positive for both “\(\downarrow \)” and “\(\uparrow \)”, meaning that the variable tends to influence the increase of defect-proneness, overall. As such, we could not delineate a common behavior for NOC. Likely, its impact depends on the peculiarities of the development process in place in the different projects rather than on more general aspects.

As for the independent variables considered in our study, namely inheritance and delegation, the discussion is similar. On the one hand, the impact of these metrics is limited to a few projects, suggesting that the defect-proneness of source code is only partially dependent on reusability metrics. On the other hand, the coefficients of the metrics vary without a common pattern. As an example, the coefficient for specification inheritance was positive for “\(\uparrow \)” in Commons-Cli and negative in Joda-Time. On the same line, implementation inheritance had a slightly positive coefficient for “\(\uparrow \)” in Jackson-Databind, while a negative coefficient in JxPath. As for the delegation, this turned to be statistically relevant on just two projects, i.e., Jackson-Databind and JxPath without a consistent sign. Hence, we could conclude that the reusability metrics themselves have a limited connection to defect-proneness. Other indicators, like the structure of the hierarchies computed by NOC, seem to have more statistical power. As such, it is not the amount of reusability mechanisms used by developers to influence the defect-proneness of source code, but rather the way these mechanisms are used in the specific cases. This result has two main implications. First, we could not identify a drawback in the use of inheritance and delegation with respect to software reliability: hence, the application of reusability mechanisms is not per se something to avoid. However, this result represents a call to researchers in software quality, who are required to devise novel quality checkers and/or empirical investigations to monitor the way code reuse is implemented and how it may negatively affect the defect-proneness of source code.

Another valuable consideration can be drawn when considering the control variables. According to our results, none of them seems to be statistically impactful on defect-proneness. We believe this is a relevant result for the software maintenance and evolution research community as a whole. Code quality metrics have been indeed often used to estimate and/or predict defects: our results indicate the lack of statistical significance and possibly imply that the set of metrics considered within defect prediction models should be reconsidered - in this sense, we corroborate previous findings on the limited value of the Chidamber-Kemerer metric suite for defect prediction (He et al. 2015; Jureczko 2011; Radjenović et al. 2013) as well as further stimulate the research on alternative predictors (Bird et al. 2011; Di Nucci et al. 2017; Palomba et al. 2017; Posnett et al. 2013).

figure v
Table 4 RQ\(_3\). Results of the statistical model

4.3 RQ \(_3\). On the Impact of Reusability Mechanisms in Code Churns

Table 4 reports the statistical results obtained when building a Generalized Linear model on the data collected for RQ\(_3\). Differently from RQ\(_2\), the dependent variable was the code churn, namely a numerical variable.

Statistical Model Explanation

The statistical model output a single coefficient for each independent variable: this coefficient corresponds to the impact of a one-unit increase on the amount of code churn. Also in this case, the statistically significant coefficients are highlighted with a ‘*’ symbol - a higher amount of ‘*’ implies a higher statistical relevance of a variable with respect to the code churn computed on a defect-fixing commit i. The variables discarded through the multi-collinearity are the same as RQ\(_2\).

Statistical Model Analysis

Looking at the table, we can draw various conclusions. As expected, the LOC metric was found to be statistically significant in 9 systems out of 12. The coefficients are also relatively high in all cases, meaning that larger classes are typically harder to maintain - in this respect, we could corroborate previous findings in literature (Hayes et al. 2004; Sjøberg et al. 2012). The CBO metric, which computes the coupling between objects, was also statistically significant in nine projects, confirming that developers spend more effort in fixing defects pertaining to highly-coupled classes (Leach 1990). Other code quality metrics were not statistically significant. So, in conclusion of this first point of discussion, we could report that, besides LOC and CBO, the role of code metrics to estimate the maintenance effort seems to be limited. Once again, this finding is of the interest of the software maintenance and evolution research community, which might be called to define novel metrics and/or instruments to monitor maintenance effort over time.

Turning the focus on our independent variables, we could find similar conclusions as in RQ\(_2\) when considering inheritance. Both specification and implementation inheritance were indeed most not statistically significant, with some exceptions. The former was relevant for the projects Commons-Cli, Jackson-Dabind, and Joda-Time. However, the sign of the coefficients revealed that the metric was statistically related to the increase of code churn only in the case of Commons-Cli. By analyzing this case further and relating the statistical result with the trend analysis conducted in RQ\(_1\), we could better understand the reason behind this correlation. Most of the defects available for Commons-Cli were introduced and fixed after the design erosion discussed in RQ\(_1\). It is therefore reasonable to believe that it was the lack or the decrease in the use of inheritance mechanisms which caused a higher maintenance effort when fixing defects. This interpretation is in line with what observed on the other systems, i.e., Jackson-Dabind and Joda-Time, where the specification inheritance was negatively correlated to maintenance effort, meaning that this was a significant factor to reduce the code churn required to fix defects.

Implementation inheritance was found to be statistically relevant in just two cases, i.e., on Jackson-Databind and JxPath. While in the former case the coefficient was close to zero—indicating little to no correlation to the dependent variable—, it was of -15.482 in the second case. Hence, also in this case we could conclude that this metric was negatively correlated to the maintenance effort. Enlarging the discussion to the other inheritance metrics subject of the study, namely NOC and DIT, we could discover similar results as RQ\(_2\). Both NOC and DIT were positively correlated to the dependent variable and the coefficients were relatively large in all cases: these results imply that the structure of hierarchies might strongly influence the maintenance effort to fix defects, hence corroborating the results obtained in our previous research question, other than the results of empirical studies reporting how NOC and DIT could worsen software maintainability (Daly et al. 1995, 1996; Prechelt et al. 2003).

As for delegation, the coefficients were mostly negative, even if relatively small. Hence, we could conclude that there exist a small negative correlation between the metric and maintenance effort, which implies that the use of delegation may decrease the overall amount of code churn required to fix defects.

figure w

5 Discussion and Implications

The results of our study revealed a number of insights which are worth to further discuss. This section elaborates on the analyses conducted and discusses the key implications of our findings for researchers and practitioners.

5.1 Further Discussion and Analyses

In this respect, there are three main points to discuss.

Relation to Existing Literature

In the first place, it is worth discussing the way our findings relate to previous research on the matter. As discussed already in Section 2, various empirical studies have linked implementation and specification inheritance to source code quality. Some of them, like the works by Albalooshi and Mahmood (2014) and Goel and Bhatia (2013), discovered negative correlations between the use of those reuse instruments and source code quality. Our results could not corroborate those observations: according to our analyses, indeed, implementation and specification inheritance are mostly correlated with positive improvements of source code. As such, we could instead confirm the “common wisdom” for which a higher degree of reusability leads to a higher maintainability of source code (Bruegge and Dutoit 2009). At the same time, we could extend the set of observations conducted on implementation and specification inheritance with respect to our previous work (Giordano et al. 2022): not only those mechanisms tend to decrease the severity of code smells over time, but also other desirable software maintenance properties, like defect-proneness and effort to fix defects. Last but not least, the statistical results provide additional insights to the body of knowledge on software evolution and maintenance effort estimation. In the former case, our commit-level analysis could provide finer-grained information on how the adoption of the three considered code reuse mechanisms evolves over time. In the latter case, instead, the results of our RQ\(_3\) unveiled the actual relation between code reuse and corrective maintenance—this represents a premier of our study.

Making Sense of the Statistical Data

By definition, our empirical study had a statistical connotation and aimed at analyzing patterns and correlations extracted through the mining of software repositories. As such, the relation between code reuse and defect-proneness has been observed quantitatively. The nature of such an analysis naturally brings some considerations about the reliability of the conclusions provided. In particular, the independent variables in our statistical exercise were computed by means of metrics accounting for their adoption and were assessed against defect-proneness through statistical correlations. The relations unveiled might therefore due to spurious correlations among metrics rather than being the result of causal inference. To account for this potential threat to validity and strengthen the conclusions of the study, we conducted an additional qualitative analysis aimed at assessing the relation between code reuse and defects. In particular, starting from the dataset considered in the study, we (1) computed the number of cases in which defect-inducing and defect-fixing commits involved the variation of inheritance and delegation metrics and (2) manually analyzed those cases to better understand the way these metrics can affect defect-proneness of source code. Such an analysis allowed us to verify more closely which kind of modifications have been applied by developers in terms of inheritance and delegation and how these led to the variations of defect-proneness. The analysis was led by the first author of the paper, who selected the relevant commits and analyzed the diffs between these and their predecessor. To support the manual investigation, the inspector employed automated static analysis tools such as RefactoringMiner (Tsantalis et al. 2020) and SonarQube (Lenarduzzi et al. 2022)—these tools were used to the sole scope of extracting additional information on the code changes applied within the commits.

Such an additional analysis first revealed that in a non-negligible amount of cases, i.e., in about 50% of the defect-fixing commits, the changes applied by developers included modifications that impacted inheritance and delegation metrics. Perhaps more importantly, those modifications were instrumental to accommodate the defect-fixing activities. For instance, let consider the commit 40689aa of the project JXPath. This commit addressed a defect concerning the evaluation of strings as boolean expressions. To fix it, the developer moved methods from the subclasses CoreOperationEqual and CoreOperationNotEqual to the abstract superclass CoreOperationCompare, and add a parameter in the super method of the subclass CoreOperationNotEqual. These operations had the effect of modifying the implementation inheritance relations of the CoreOperation hierarchy. This example well shows how code reuse is employed in practice to reduce the overall complexity of the system and possibly reduce defect-proneness. Indeed, the developer exploited code reuse to let propogate the fix to all subclasses that would have possibly been affected by the string evaluation defect, hence reducing defect-proneness while improving software maintainability. We observed similar cases in the dataset, particularly in 75% of the commits where inheritance and delegation metrics varied as a consequence of defect-fixing activities - for the sake of completeness, we report the details of this qualitative investigation in our online appendix Giordano et al. (2022).

On Metrics and Their Relation to Defect-Proneness

The last point to further discuss is concerned with the role of the considered metrics with respect to their relation to defect-proneness. In this respect, two observations should be made. In the first place, we discovered that our inheritance and delegation metrics, coming from the operationalization of the reusability mechanisms used by developers, have a relatively low impact on defect-proneness. In the second place, we found out that the control variables of our statistical analysis, namely the metrics pertaining to the Chidamber and Kemerer (1994) metric suite, have also a limited connection to defect-proneness. Both findings are somehow surprising: these metrics were indeed experimented in plenty of studies on source code quality and researchers have been often analyzing the extent to which they can support the monitoring and prediction of defect-proneness of source code (Basili et al. 1996; Gyimóthy et al. 2005).

To provide further, more actionable insights into our findings and better understand the extent to which our statistical analysis would be actually corroborated when considering the impact of code quality metrics on defect prediction, we conducted an additional analysis where we (i) built a defect prediction model and (ii) assessed whether the findings obtained in the context of RQ\(_2\) might have been confirmed.

More specifically, given that our analysis granularity level was the commit and that we needed to account for the time relations between commits, we focused on the so-called just-in-time defect prediction (Kamei et al. 2012), that is, the creation of defect prediction models able to assess the defectiveness of individual code commits based on the data collected through the analysis of previous commits.

To make our analysis as precise and sound as possible, we conducted a partial replication of the work by Pascarella et al. (2019), who experimented with a large set of features composed of 24 process, product, and developer-oriented metrics to capture the defectiveness of code commits. As product metrics, the original authors used the metrics also employed within our study. Through this replication, we could therefore assess the role of these metrics when considering their contribution to defect prediction, other than comparing such a contribution with respect to additional metrics typically used in defect prediction, hence enlarging our overview on the value of the considered metrics. While Pascarella et al. (2019) mainly focused on a variant of the problem of just-in-time defect prediction aiming at predicting defective files within commits rather than defective commits, they also compared against a standard just-in-time defect prediction model, hence enabling an analysis at commit-level. The reason for relying on this work was threefold. In the first place, Pascarella et al. (2019) released an online appendix with all the scripts used in their study and documentation that enables the exact replication of their work: as such, we avoided possible bias due to the re-implementation of the defect prediction model. Second, one of the authors of the work by Pascarella et al. (2019) is also a co-author of this submission: as a consequence, we could exploit his knowledge in case of replication issues. Third, Pascarella et al. (2019) took into account a large amount of metrics having different nature and coming from previous literature on defect prediction (Kamei et al. 2012; Rahman and Devanbu 2013): as such, we could conduct a larger and sound experimentation of how quality metrics affect the performance of just-in-time defect prediction. To conduct our analysis, we performed the following steps:

  • For each project considered in our study, we mined all the commits to compute the 24 process, product, and developer-oriented metrics. Since the metrics were computed on the files modified within the considered commits, we aggregated them to have a unique commit-level value for each metric. This was done using the “group by” operation, considering the commit hash as the primary key, and applying the mean and median over all the metrics;

  • We merged the information collected with the one available in our dataset: for each project and for each commit, we combined the 24 process, product, and developer-oriented metrics with the inheritance and delegation metrics;

  • We trained and tested a Random Forest classifier, i.e., the best classifier identified in the work by Pascarella et al. (2019), by applying a Time Series Split validation. This is a time-aware variant of the cross-fold validation that (i) divides the dataset into K (in our case, K = 10) folds and (ii) in the k\(^{th}\) split, it returns first k folds as train set and the (k+1)\(^{th}\) fold as test set.Footnote 6

  • This validation can be applied when the time order may impact the results and avoid training the model using future commits to predict the defectiveness of past commits. The performance of the model was assessed through multiple evaluation metrics such as precision, recall, F-Measure, and AUC-ROC.

We investigated two predictive configurations. In the first one, we devised a binary defect prediction model that predicts a commit as defective or not, i.e., the standard defect prediction scenario. In the second configuration, we devised a multi-class defect prediction model able to assess how the source code defectiveness varies over the evolution of the project, i.e., a defect prediction scenario where the task is to foresee the defectiveness trend in terms of increase, decrease, or stability of the number of defects within a software project. This latter scenario is closer to the research methods employed in our study and was set up with the aim of embedding additional evolutionary considerations within the defect prediction model and investigating the contribution of code quality metrics to assess the overall defectiveness of a software project. From a more technical perspective, the model was devised to assign a commit to a categorical variable within the set {‘Increased’, ‘Decreased’, ‘Stable’}, namely the same variables used within the Multinomial Log-Linear statistical model built to address RQ\(_2\).

For both predictive scenarios, we ran the model twice: the first time relying on all the metrics and the second time relying on all metrics but those concerned with inheritance and delegation. This was done in an effort to more closely monitor the impact of the main variables of our work, i.e., inheritance and delegation metrics, by quantifying the accuracy gain/drift achieved when considering them as features of the defect prediction models. In addition, we also computed the feature importance to verify which metrics were most relevant for the experimented models.

In terms of results, we could draw multiple considerations. When considering the binary defect prediction scenario, the performance achieved was close to 94% in terms of F-Measure both when considering the models with and without inheritance and delegation metrics. On the one hand, this result seems to indicate that the overall defect prediction capabilities cannot be improved through the use of reusability metrics, hence confirming the results of RQ\(_2\), i.e., inheritance and delegation metrics have a limited connection to defect proneness. On the other hand, it is worth observing that improving over an F-Measure of 94% is always particularly tough: in this sense, the contribution given by inheritance and delegation metrics may be somehow “hidden” by the high performance of the defect prediction model. As a consequence, a more reasonable way to assess the contribution of reusability metrics was to assess the feature importance of the metrics considered by the model relying on inheritance and delegation indicators. Through this analysis, we discovered that (1) the Random Forest classifier never selects specification and implementation inheritance among the top-20 features to use for predicting defective commits in the considered projects; (2) the amount of delegations was in the top-15 features employed by the model in all the projects; (3) the specification and implementation inheritance metrics had limited predictive power, with other inheritance metrics such as NOC and DIT having a slightly higher impact on the predictions. These findings were perfectly in line with the observations reported in RQ\(_2\): we could indeed further corroborate that the defect-proneness of source code is only partially dependent on reusability metrics and that, instead, the way developers structure hierarchies might impact defects more than the specific reusability mechanisms employed.

In addition, our RQ\(_2\) revealed that the control variables used in our statistical analysis, i.e., the Chidamber-Kemerer metrics, were not statistically impactful on defect proneness. The defect prediction investigation confirmed these findings as well. Indeed, the feature importance analysis constantly reported process metrics such as the entropy of changes (Hassan 2009), the scattering of code changes (Di Nucci et al. 2017), and commit date (Rahman and Devanbu 2013) as the most impactful features. In the first place, our findings corroborate previous research showing that process metrics can better predict defects with respect to traditional code quality attributes (Rahman and Devanbu 2013) and, as a consequence, provides additional support to the research field involved in the definition of process and developer-oriented metrics for defect prediction. Secondly, our research outlines that the use of code quality metrics, including the inheritance and delegation ones, to assess the defectiveness of source code may result in suboptimal recommendations for developers and, for this reason, these metrics should be used for different purposes and/or for different use cases: for instance, our previous work (Giordano et al. 2022) revealed that quality, inheritance, and delegation metrics can positively contribute to the evolutionary analysis of code smells.

A similar discussion could be done when considering the multi-class prediction model. Also in this case, we found that the models relying and not on reusability metrics had similar performance in terms of F-Measure (94%), with inheritance and delegation metrics that were selected by the Random Forest classifier for all projects. While they had a lower predictive power than NOC and DIT, we found that both inheritance and delegation metrics were more impactful than cohesion, coupling, and complexity metrics, e.g., LCOM, CBO, WMC. As such, we could further corroborate that quality, inheritance, and delegation metrics have a limited connection to defect proneness. Similarly to the previous experiment, the entropy of changes (Hassan 2009), the scattering of code changes (Di Nucci et al. 2017), and commit date (Rahman and Devanbu 2013) were the most important characteristics to predict defective commits, hence suggesting that evolutionary considerations on the defect proneness of source code should be made through the analysis of historical information coming from the the complexity of the development process.

All in all, our findings corroborated the negative results obtained by previous researchers who experimented with code quality metrics in defect prediction (He et al. 2015; Jureczko 2011; Radjenović et al. 2013). While this is already worrisome for the entire software maintenance and evolution research community, our findings should be considered as even more worrisome because of the granularity of the analysis conducted. We indeed elaborated on the change history information of software projects, analyzing how code quality metrics were related to defect-proneness throughout the evolution of the considered projects, discovering that none of them was statistically correlated to the variation of defect-proneness. As such, our results represent an additional alarm signal for the research community. Our future research agenda includes experimentations aiming at elaborating on code quality metrics and their actual relation to software maintenance. For the sake of completeness, we report the details of this analysis in our online appendix (Giordano et al. 2022).

Fig. 4
figure 4

Use case scenario in which the monitoring of reusability metrics might be exploited

5.2 Implications of the Study

On the basis of the results achieved and the additional discussion points elaborated in the previous section, we identified a number of implications for researchers and practitioners.

Monitoring Usage Trends to Improve Software Quality

The usage trends elicited in the context of RQ\(_1\) revealed various forms in which code reusability mechanisms are employed throughout software evolution, while the results obtained in RQ\(_2\) and RQ\(_3\) - and the additional qualitative analysis discussed in Section 5.1 - pointed out the benefits reusability may have to reduce both risks connected to poor software reliability and effort required for corrective maintenance activities.

Altogether, these findings seem to suggest that an advanced knowledge on how to improve software quality might be obtained by exploiting precious pieces of information coming from the analysis of the change history of software projects. For instance, we envision the definition of monitoring techniques that, by exploiting the way developers use to adopt code reusability mechanisms, may recommend the most appropriate actions to conduct while performing corrective maintenance. Similarly, we can envision the definition of novel approaches based on nudge-theory (Brown 2019) to stimulate developers toward the more frequent or most appropriate adoption of code reuse to reduce the overall defect-proneness of source code. To make our conjectures more tangible, let us consider the scenario depicted in Fig. 4, which represents the way we envision a monitoring system may support developers during software maintenance and evolution. More specifically, suppose that a system ‘S’ contains a module ‘A’ having (1) multiple submodules, i.e., ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’ and ‘L’ in Fig. 4, each either directly or indirectly inheriting from ‘A’; (2) some operations through which the submodules delegate operations to ‘A’. In such a scenario, a regular monitoring of reusability metrics or the prediction of usage trends may allow the developer to observe or predict the way the inheritance and delegation relations vary over time, possibly detecting or even preventing the increasing complexity affecting ‘A’ and its submodules, other than the presence of suboptimal design decisions that would require some refactoring actions.

For instance, suppose that in the scenario proposed in Fig. 4 a monitoring system realizes that the amount of functionalities provided by ‘A’ is steadily increasing, with the frequency of ‘A’ being reused decreasing in the submodules—this case may indicate that the system is in the descending path of a ‘increasing-decreasing’ implementation inheritance pattern identified in RQ\(_1\). This may indicate a suboptimal use of inheritance and delegation: ‘A’ offers more services, but the submodules inheriting from it do not fully exploit them, suggesting that they are not properly exploiting the inheritance mechanism—note that a similar scenario has been associated with multiple risks for software reliability, including an increasing change- and defect-proneness (Palomba et al. 2018) and a higher likelihood of the system being maliciously attacked because of the suboptimal visibility granted to fields and operations (Spooner 1988). By monitoring reusability metrics, multiple insights may be provided. On the one hand, developers may be informed of the evolution of reusability metrics and exploit such an information to schedule quality assurance sessions aiming at reducing quality and security concerns, e.g., code review targeting ‘A’ and the way the submodules interact with it. On the other hand, automated instruments might exploit reusability metrics to recommend refactoring actions aiming at simplifying the hierarchy: for instance, the situation described above, i.e., submodules not fully exploiting the features of ‘A’, may suggest the presence of a Refused Bequest smell (Fowler 2018), whose refactoring may either consist of defining a new superclass only containing the fields and operations that are actually needed to the submodules, i.e., Extract Superclass refactoring, or replacing the inheritance mechanism with delegations, Replace Inheritance with Delegation refactoring.

On the basis on the considerations above, the multifaceted ways our findings can be exploited therefore represent a call for researchers in the field of software quality and software maintenance and evolution.

Code Reuse and Its Adoption: Two Sides of the Same Coin

Our empirical investigation (RQ\(_2\) and RQ\(_3\)) revealed a dichotomy between the concept of code reusability and its actual application. In particular, we found that while reusability itself is a useful instrument to improve software quality and reduce maintenance effort, an inappropriate adoption of these mechanisms may have negative implications. This is indeed the case observed with DIT and NOC in our statistical exercise, two well-known metrics that measure the extent of the hierarchical relations among classes. We found that increases in terms of hierarchical relations lead to negative variations of the defect-proneness of software artifacts. As such, we argue the need for further research, especially in terms of software refactoring optimization. Researchers are indeed called to better investigate the reasons behind the misuse of inheritance and delegation mechanisms and when and why these can deteriorate software quality. These investigations would be instrumental to the definition of novel refactoring techniques that may support developers while optimizing hierarchies of classes.

At the same time, our findings provide two key implications for practitioners. On the one hand, an improved knowledge of the usage patterns might be beneficial to understand the way code reusability evolves in their own projects: practitioners would therefore put in place monitoring instruments to verify the evolution of inheritance and delegation uses and assess how the usage trends co-evolves with software quality. On the other hand, our results might be exploited by practitioners to reason on the use and misuse of inheritance and delegation mechanisms, other than on how the creation of complex hierarchies might possibly worsen source code quality and increase corrective maintenance effort.

Prediction of Code Quality Properties: The Road Ahead

Another aspect to consider is the one concerned with the prediction of code quality properties. In this respect, the findings coming from our research questions altogether contribute to increase the research community awareness with respect to the need for novel code quality prediction techniques and tools. First, the traditional code quality metrics employed in prediction models have little to no correlation to defect-proneness. Second, code reusability mechanisms might potentially boost the code quality analysis and possibly being used within predictive modeling techniques. In addition, the usage trends can be exploited to recommend which of the features would be more worth to use in specific moments of the evolution. All these aspects, emerged from our analyses, represent future perspectives that our research community would like to further investigate. We envision multiple experimentations aiming at revisiting previous findings obtained in literature to account for the evolutionary nature of software - the research method employed in our study, which took the change history information into account, may indeed be generalized to understand how different code quality metrics evolve over time and how they impact software quality. In our opinion, analyses of this type would potentially lead to revolutionize code quality as we know, revealing insights driven by the actual adoption of code metrics by developers.

At the same time, we envision novel techniques that, by analyzing the evolutionary development context, may feed predictive models with the most relevant metrics to predict source code quality. Also in this case, we believe that an evolution- and context-aware view of predictive software maintenance might potentially substantially boost the support that we, as researchers, may provide to practitioners.

These observations represent the road ahead of software quality prediction models and are part of our future research agenda on the matter.

On the Teaching of Reusability Mechanisms

From an educational perspective, our findings provided multiple insights that may be useful to guide or tune the teaching of reusability mechanisms. In the first place, the findings coming from RQ\(_1\) reported that inheritance and delegation instruments typically follow four well-defined adoption patterns, each of them having implications on source code quality and being motivated by contextual development factors. For instance, we observed that a “decreasing-increasing” pattern in terms of inheritance adoption might be motivated by the substantial rework required to include third-party libraries or adapt the architecture of the system being developed and may naturally favor these complex modifications. As a consequence, teaching the contextual circumstances making these patterns instrumental for software maintenance and evolution tasks may potentially increase the awareness of the next generation of software engineers toward the adoption of reusability mechanisms, other than increasing their willingness to actually employ them in practice. In other terms, rather than teaching reusability on its own, our findings suggest that an improved way of teaching those principles might involve more complex scenarios where students are exposed to contextual situations requiring them to understand the benefits and drawbacks of reusability, other than the impact that reusability may have on other evolutionary tasks.

Also, RQ\(_2\) showed that the defect-proneness of source code is not influenced by the reliance on inheritance and delegation mechanisms, but rather by the specific adoption of these mechanisms. In our opinion, this is a key finding from the educational perspective: we argue that case-based learning (Eshach and Bitterman 2003) might be a notable advance to let students reason on the effects that reusability may have in specific use cases, hence having a tangible and concrete understanding of the implications of reusability for software quality. In this sense, the use of gamification (Caponetto et al. 2014) might further stimulate the capabilities of students to distinguish when and why reusability may represent a valuable tool to improve software quality and reduce risks to software reliability. On a similar note, the results of RQ\(_3\) indicated that the adoption of inheritance and delegation may reduce the effort required to fix defects. Also in this case, the use of case-based learning and gamification may allow students to work on specific, ad-hoc use cases where they are required to fix defects through the use of reusability mechanisms and assess the impact of their actions on software quality and reliability.

It is our hope that the insights of our study can be of inspiration for educators, who may partially redesign their courses to account for our findings, and software engineering education researchers, who may further investigate the way teaching reusability differently impacts the students’ abilities to use inheritance and delegation instruments in practice.

6 Threats to Validity

A number of potential threats might have biased the study. This section discusses them and reports the mitigation strategies applied.

Threats to Construct Validity

Threats in this category refer to a possible mismatch between theory and observation. In this respect, the selection of the dataset represents a crucial point for which there are various observations and remarks to make. We used Defects4J (version 2.0.0), which has been already widely used by the research community in several previous studies (e.g., Jiang et al. 2019; Perera 2020; Sobreira et al. 2018) and that reduced possible bias due to the presence of uncontrolled conditions, e.g., tangled changes (Herzig et al. 2016), allowing us to investigate the impact of reuse mechanisms on defect-proneness and maintenance effort more precisely.

As for the defects considered, the Git repositories of the considered projects may contain more issues than those reported in Defects4J. However, there are two observations to make in this respect. First, a notable amount of these issues do not actually pertain to defects but to other maintenance and evolution tasks. For instance, let us consider the case of the commons-collections project, i.e., the project having the least amount of defects in our study. According to the issue tracker,Footnote 7 the project has a total of 787 issues (filtering by Type=‘All’ and Status=‘All’): of those, only 374 pertain to defects (filtering by Type=‘Bug’ and Status=‘All’), while the remaining 413 issues refer to enhancements, implementation of new features, and other evolutionary tasks. As such, the set of candidate defects that we might have considered is much lower in size with respect to the raw data reported on the issue trackers. In the second place, a number of issues do not report reliable information. Still taking the commons-collections project as an example, we noticed that 159 of the issues marked as ‘Closed’ or ‘Resolved’ (filtering by Type=‘Bug’ and Status=‘Resolved, Closed’) report the strings “Invalid”, “Not a Bug”, “Won’t Fix”, “Cannot Reproduce”, and “Duplicate” as actual resolution, hence indicating that these defects were false positives, not taken into account by the developers, or already addressed as part of duplicated issue reports. As a conclusion, we found out that issue trackers contain a non-negligible amount of noise that would require substantial filtering and data quality procedures, which is indeed what Defects4J guarantees.

Still reasoning on the number of issues reported on the issue trackers of the considered systems, it is worth remarking that the candidate set of defects was limited by the types of defects and the types of fixes performed. We should distinguish multiple cases. First, some defects may not pertain to production code, e.g., test code defects, or might relate to the update of third-party libraries or configuration files. As explained in Section 3.1, these defects were not considered by Defects4J and, as a consequence, by our work. However, these defects would have not created any noise for our analysis: indeed, our work aims at understanding how reusability metrics affect the defect proneness of the production code and, for this reason, all the defects that are not related to production code cannot affect our measurements. Second, some defects might not be verifiable or not traceable, even though they relate to the production code. As for the former, they might either represent true defects that developers did not have enough time to deal with or false positives, namely defects that developers ignored and that were marked as ‘Resolved’ or left opened in the issue tracker without any further action: considering these defects in our analysis would have caused some degree of uncertainty in terms of number of defects considered and, for this reason, we would have likely introduced some bias. As for the latter, these are defects that we could not trace back in the history of the considered projects and, as such, we could not technically analyze without approximation or heuristics that would have, again, introduced some degree of uncertainty. Last but not least, the candidate set of defects might have been limited by the types of fixing activities: Defects4J indeed discards defects whose fixes were performed along with other maintenance and evolution activities, e.g., tangled changes. Among the various cases discussed, this latter was the most critical in our case, as it refers to real defects that were not considered in the scope of the analysis and that might have biased the computation of the number of defects in the change history of the projects considered. A systematic assessment of the noise caused by these missing defects would have required the definition of dedicated data quality protocols through which we could have (i) systematically classified real defects among those not considered by Defects4J; (ii) analyzed the corresponding fixes to understand their nature; and (iii) assessed the extent to which our findings varied when considering the newly classified defects. To the best of our knowledge, the current literature does not offer any (semi-)automated instrument to perform a similar assessment nor guidelines to follow. We deemed the research investigation and methods required to perform such a systematic assessment as out of scope. Nonetheless, to partially analyze the potential noise given by those missing defects, we have attempted to estimate the noise of our analysis in the case of the commons-collections project through a simple, likely suboptimal approach based on text mining and manual analysis. We first (i) mined the summary of each of the 215 marked as ‘Closed’ or ‘Resolved’ defects having as resolution the string “Fixed”, and (ii) used a keyword-based approach to classify those issues according to their type. More specifically, we classified an issue as ‘test-related’ if the summary contained the keyword “test”, as ‘documentation-related’ if it contained keywords such as “JavaDoc” and “comment”, and as ‘configuration-related’ if it contained keywords such as “JDK”, “compil*”, “build”, and “CI”. In this way, we could estimate the amount of issues whose fixes did not modify the production code, hence covering the first case described above. Afterwards, we manually went through the summaries of the remaining issues to assess how many of them revolved around modifications that were not verifiable, not traceable, or that performed modifications other than defect fixes—hence covering the other possible cases of noise. As a result, we discovered that 181 issues were not considered within Defects4J. Among them, 1% referred to Continuous Integration concerns, 7% to JDK compilation issues, 13% to test code defects, e.g., flaky tests, and 17% to documentation issues, e.g., unclear JavaDoc comments. Hence, 69 of them (38%) of the discarded defects did not concern production code. From the subsequent manual analysis, we discovered that 21 were untraceable (19%), while 84 were issues raised by specific users that the maintainers of the system solved by recommending configuration changes, hence not making any change to the system itself (46%). The remaining 7 defects were not correctly classified by the keyword-based approach and pertaining to documentation or configuration issues - in these cases, the summaries reported keywords different from those used by the classifier, e.g., “typo”. Perhaps more interestingly, we found that 34 defects matched the requirements of Defects4J: yet, six were reported between November 2020 and June 2023, namely after the release of Defects4J 2.0.0 (issued on September 15, 2020), while 24 were part of the defects deprecated by Defects4J. As such, the set of defects actually analyzable was four, which is exactly the number of defects we analyzed. While such an additional analysis was not performed on all the considered systems, it let us provide some insights on the noise possibly affecting our results. While we acknowledge that our study took into account only a subset of defects having specific properties, it actually contains most of the real defects that should be taken into account. The noise caused by the presence of additional issues on the issue trackers is likely to be limited, as most defects and corresponding fixes are not related to production code. In conclusion, we argue that our conservative approach in terms of defect selection, i.e., that of relying on the defects pointed out by Defects4J, represents the best option to properly measure the extent to which reusability mechanisms impact the defect proneness of source code. As a side result of our additional analysis, we could also further corroborate the validity of Defects4J - which we consider as a valuable outcome for our research community.

A second threat to validity relates to the selection of the metric used to operationalize maintenance effort. We used code churn (Munson and Elbaum 1998): we are aware that this metric can only proxy the actual effort spent when maintaining source code, yet this choice is required in our case because of the unavailability of precise data regarding the maintenance effort in our dataset. Nonetheless, proxy measurements are still used and considered in the field (Shihab et al. 2013). The tool we used to extract metrics, e.g., reusability or CK metrics, represents another potential threat to validity. We used tools already validated and used by the research community (Giordano et al. 2022; Spadini et al. 2018). Finally, as mentioned in Section 3.1, in Defects4j a single bug can be introduced by multiple factors, but its resolution will always occur within a Java file. Thus, to avoid possible threats to contraction validity, we discard commits that introduced defects caused by issues not involving source code. This allowed us to only focus on defects introduced and resolved through changes to the source files.

Threats to Internal Validity

These threats refer to factors that might have impacted the results of the study. In our context, these might be connected to the selection of the metrics used to build the statistical models. On the one hand, we were interested in understanding the role of reusability metrics and, for this reason, we operationalized implementation and specification inheritance, other than delegation, following their exact definition. On the other hand, we used control variables previously shown to be significantly correlated to source code quality (Tamburri et al. 2020; Succi et al. 2005; Chhikara et al. 2011; Daly et al. 1996). Through these actions, we could rely on a set of independent variables and control metrics which come from either our working hypotheses or the state of the art.

Threats to Conclusion Validity

Threats related to this category refer to the selection and the use of the statistical test. When addressing RQ\(_2\) we modeled the problem using a Multinomial Logistic Linear model (Theil 1969), while we built a Generalized Linear model (Faraway 2016) in the context of RQ\(_3\). These choices come from the nature of our response variables, i.e., multiclass and continuous, respectively. Moreover, the research community used these types of model in similar contexts (Catolino et al. 2021; Giordano et al. 2022; Lambiase et al. 2022). The empirical analysis conducted in this study had a quantitative connotation and, in particular, we sought to understand the relation between code reusability and defects through statistical modeling. Nonetheless, we are aware that more qualitative investigations aiming at linking the root cause of defects with the reuse mechanisms might potentially reveal further insights into the matter. While a more complete overview of this type is part of our future research agenda, in the context of this work we already provided some preliminary insights through the manual analysis discussed in Section 5. Such an analysis was in line with the statistical conclusions drawn when addressing RQ\(_2\) and RQ\(_3\), increasing our confidence in the results reported in the paper.

Threats to External Validity

As for the generalizability of the results, the main threat might be connected to the target of our work. In particular, we focused on 12 Java projects having more than 44,900 commits and coming from the Defects4J dataset. As such, our work was based on the analyses conducted on a sample: our generalization strategy can be identified within the sample-based generalization strategies proposed by Wieringa and Daneva (2015). In particular, among those strategies, the “statistical learning” seems to be the most appropriate. Wieringa and Daneva (2015) reported that the “descriptions of statistical sample phenomena can be used to predict similar phenomena in new samples. [...]. The goal is not to generalize to a population, but to generalize to the next few cases”. This strategy is basically in line with the generalizing by similarity principle described by Ghaisas et al. (2013). When contextualizing those strategies in our case, it is likely that similar results might be obtained in projects having similar characteristics with respect to those analyzed in our work (see Table 1). Therefore, we cannot claim the generalizability of our findings to projects having different properties or even written in different programming languages. Replications in these contexts would still be desirable and already part of our future research agenda.

7 Conclusion

In this paper, we empirically assessed the evolution of reusability metrics and their impact on defect-proneness and maintenance effort to fix defects. To conduct our analysis, we focused on two specific reusability metrics such as inheritance and delegation. Our empirical study was conducted on the projects available in Defects4J, a well-known dataset reporting a set of Java projects along with their own defects. Notably, we conducted the study using a commit-level granularity, in an effort of providing finer-grained observations into the relevance of reusability mechanisms for handling defects.

In the first place, the results let emerge five usage patterns through which specification inheritance, implementation inheritance, and delegation are used throughout software evolution. Secondly, we discovered that the reusability mechanisms are, overall, associated to a decrease of defect-proneness and maintenance effort. At the same time, we found out that other inheritance metrics, like NOC and DIT, relate more to the dependent variables, hence suggesting that it is not the reuse itself that influences defects, but rather the way these mechanisms are used by developers to create hierarchies. These findings raised a number of implications for researchers and practitioners, especially with respect to the need for (1) novel code quality checkers that might monitor how developers adopt reuse mechanisms and how these impact on source code quality; (2) revising previously proposed code quality prediction models on the basis of how code reuse evolves over time.

To sum up, this article proposed the following contributions:

  1. 1.

    The first large-scale empirical study conducted at commit-level to understand how reusability mechanisms are employed by developers over time;

  2. 2.

    Statistical insights into the relation between three code reuse mechanisms, i.e., implementation inheritance, specification inheritance, and delegation, and defect-proneness of source code, both considering the likelihood of code being defective and the effort required to fix defects;

  3. 3.

    A publicly available replication package (Giordano et al. 2022), which releases data and scripts used to conduct this study and that can be used by fellow researchers to replicate the study and build on top of our findings.

Our future research agenda will be devoted to the replication of the analyses conducted on different datasets—including projects written in different programming languages—and considering a larger amount of code reuse mechanisms, e.g., design patterns. In addition, we plan to conduct qualitative investigations to corroborate the findings of the study. Last but not least, we will work toward the definition of novel code quality monitoring systems and prediction models that exploit the results of our empirical study to improve the support provided to practitioners.