What really changes when developers intend to improve their source code: a commit-level study of static metric value and static analysis warning changes

Trautsch, Alexander; Erbel, Johannes; Herbold, Steffen; Grabowski, Jens

doi:10.1007/s10664-022-10257-9

What really changes when developers intend to improve their source code: a commit-level study of static metric value and static analysis warning changes

Open access
Published: 14 January 2023

Volume 28, article number 30, (2023)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

What really changes when developers intend to improve their source code: a commit-level study of static metric value and static analysis warning changes

Download PDF

3228 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

Many software metrics are designed to measure aspects that are believed to be related to software quality. Static software metrics, e.g., size, complexity and coupling are used in defect prediction research as well as software quality models to evaluate software quality. Static analysis tools also include boundary values for complexity and size that generate warnings for developers. While this indicates a relationship between quality and software metrics, the extent of it is not well understood. Moreover, recent studies found that complexity metrics may be unreliable indicators for understandability of the source code. To explore this relationship, we leverage the intent of developers about what constitutes a quality improvement in their own code base. We manually classify a randomized sample of 2,533 commits from 54 Java open source projects as quality improving depending on the intent of the developer by inspecting the commit message. We distinguish between perfective and corrective maintenance via predefined guidelines and use this data as ground truth for the fine-tuning of a state-of-the art deep learning model for natural language processing. The benchmark we provide with our ground truth indicates that the deep learning model can be confidently used for commit intent classification. We use the model to increase our data set to 125,482 commits. Based on the resulting data set, we investigate the differences in size and 14 static source code metrics between changes that increase quality, as indicated by the developer, and changes unrelated to quality. In addition, we investigate which files are targets of quality improvements. We find that quality improving commits are smaller than non-quality improving commits. Perfective changes have a positive impact on static source code metrics while corrective changes do tend to add complexity. Furthermore, we find that files which are the target of perfective maintenance already have a lower median complexity than files which are the target of non-pervective changes. Our study results provide empirical evidence for which static source code metrics capture quality improvement from the developers point of view. This has implications for program understanding as well as code smell detection and recommender systems.

Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction

Article 03 June 2024

Measuring code maintainability with deep neural networks

Article 21 January 2023

A Neural Architecture for Detecting Identifier Renaming from Diff

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software quality is notoriously hard to measure (Kitchenham and Pfleeger 1996). The main reason is that quality is subjective and that it consists of multiple factors. This idea was formalized by Boehm and McCall in the 70s (Boehm et al. 1976; McCall et al. 1977). Both introduced a layered approach where software quality consists of multiple factors. The standard ISO/IEC 9126 (2001) and successor ISO/IEC 25010 (2011) also approach software quality in this fashion.

All these ideas contain abstract quality factors. However, the question remains what concrete measurements can we perform to evaluate the abstract factors of which software quality consists, i.e., how do we measure software quality. Some software quality models recommend concrete measurements, e.g., ColumbusQM (Bakota et al. 2011) and Quamoco (Wagner et al. 2012). Defect prediction researchers also try to build (machine learning) models to find a function that can map measurable metrics to the number of defects in the source code. This can also be thought of as software quality evaluation, that tries to map internal software quality, measured by code or process metrics, to external software quality measured by defects (Fenton and Bieman 2014). The internal and external quality categories can also be mapped to perfective and corrective maintenance categories after Swanson (1976). Perfective maintenance should increase internal quality while corrective maintenance should increase external quality. Both categories should increase the overall quality of the software. To ease the readability, we adopt the perfective and corrective terms defined by Swanson for the rest of the paper when referring to the categories. For general assumptions, we adopt the internal and external quality terms. Internal quality represents what the developer sees, e.g., structure, size, and complexity while external quality what the user sees, e.g., defects.

Software quality models and defect prediction models use static source code metrics as a proxy for quality (Hosseini et al. 2017). The intuition is that complex code, as measured by static source code metrics, is harder to reason about and, therefore, is more prone to errors. However, recent research by Peitek et al. (2021) showed that measured code complexity is perceived very differently between developers and does not translate well to code understanding. A similar result was found by Scalabrino et al. (2021) although their work is focused on readability measured in a static way. Both studies, due to their nature, observe developers in a controlled experiment with code snippets. To supplement these results, it would be interesting to measure what developers change in their code “in the wild” to improve software quality and if their intent matches what we can measure, e.g., if complexity is reduced in a change that intends to improve quality.

While there are multiple publications on maintenance or change classification after Swanson (1976), e.g., Mockus (2000), Mauczka et al. (2012), Levin and Yehudai (2017) and Hönel et al. (2019), we are not aware of publications that investigate differences between multiple software metrics for corrective and perfective maintenance as well as their counterparts, i.e., non-perfective and non-corrective. The inclusion of these counterparts results in computational effort as we need every metric for every file in every commit. However, we are able to provide this data via the SmartSHARK ecosystem (Trautsch et al. 2017, 2020b). This additional effort allows us to infer if categories of changes are different when regarding all changes of a software project. Most recent work focuses on certain aspects instead of a generic overview, e.g., how software metric values change when code smells are removed (Bavota et al. 2015) or refactorings are applied (Bavota et al. 2015; Alshayeb 2009; Pantiuchina et al. 2020).

However, we believe that taking a step back from focused approaches and investigating generic quality improvements is worthwhile. A generic overview has the advantage of mitigating possible problems that can occur for narrow meaning keywords of topically focused approaches while at the same time providing a cohesive overview. Moreover, it allows for generic statements about software quality evolution based on this information and can complement focused approaches.

In this work, we find changes that increase the quality, while we measure current, previous and delta of common source code metric values used in a current version (Bakota et al. 2014) of the Columbus quality model (Bakota et al. 2011). We use the commit message contained in each change to find commits where the intent of the developer is to improve software quality. This provides us with a view of corrective and perfective maintenance commits.

Within our study, we first classify the commit intent for a sample of 2,533 commits from 54 open source projects manually. The manual classification is provided by two researchers according to predefined guidelines. According to the overview of previous research in this area provided by AlOmar et al. (2021) our study would be the largest manual classification study of commits. We use this data as ground truth to fine-tune a state-of-the-art deep learning model for natural language processing that was pre-trained exclusively on software engineering data (von der Mosel et al. 2022). After we determine the performance of the model, we classify all commits, increasing our data to 125,482 commits.

We use the automatically classified data to conduct a two part study. The first part is a confirmatory study into the expected behavior of metric values for quality increasing changes. Expected behaviour, e.g., complexity is reduced in quality increasing changes, is derived as hypothesis from existing quality models and the related literature.

In case our data matches the expected behavior from the literature, we can confirm the postulated theories and provide evidence in favor of using the measurements. Otherwise, we try to establish which metrics may be unsuitable for quality estimation, including the potential reasons. Even further, we determine whether metrics used in software quality models are impacted by quality increasing maintenance, therefore providing an evaluation for software quality measurement metrics.

The second part of our study is of exploratory nature. We investigate which files are the target of quality improvements by the developers. We explore whether only complex files are receiving perfective changes and which metric values are indicative of corrective changes. This provides us with data for practitioners and static analysis tool vendors for boundary values which are likely to have a positive impact on the quality of source code from the perspective of the developers.

Overall, our work provides the following contributions:

A large data set of manual classifications of commit intents with improving internal and external quality categories.
A confirmatory study of size and complexity metric value as well as static analysis warning changes for quality improvements.
An exploratory study of size and complexity metric values as well as static analysis warnings of files that are the target of quality improvements.
A fine-tuned state-of-the-art deep learning model for automatic classification of commit intents.

The main findings of our study are the following:

We confirm previous work that quality increasing commits are smaller than changes unrelated to quality.
While perfective changes have a positive impact on most static source code metric values and static analysis warnings, corrective changes have a negative impact on size and complexity.
The files that are the target of perfective changes are already less complex and smaller than files which are not the target of perfective changes.
The files that are the target of corrective changes are more complex and larger than files which are not the target of corrective changes.

The remainder of this paper is structured as follows. In Section 2, we define our research questions and hypotheses. In Section 3, we discuss the previous work related to our study. Section 4 contains our case study design with descriptions for subject selection as well as data sources and analysis procedure. In Section 5, we present the results of our case study and discuss them in Section 6. Section 7 lists our identified threats to validity and Section 8 closes with a conclusion of our work.

2 Research Questions and Hypotheses

In our study, we answer two research questions.

RQ1: Does developer intent to improve internal or external quality have a positive impact on software metric values? Previous work provides us with certain indications about the impact on software metric values. This is part of our confirmatory study, and we derive two hypotheses from previous work regarding how size and software metric values should change for different types of quality improvement. We formulate our assumptions as hypothesis and test these in our case study.
- H1: Intended quality improvements are smaller than non-perfective and non-corrective changes. Mockus (2000) found that corrective changes modify fewer lines while perfective changes delete more lines. Purushothaman and Perry (2005) also observed more deletions for perfective maintenance and an overall smaller size of perfective and corrective maintenance. Both studies provide measurements we base our hypothesis on. While they are using the same closed source project we will be able to see if our assumption holds for our multiple Java open source projects.
  
  Hönel et al. (2019) used size-based metrics as additional features for an automated approach to classify maintenance types. They found that the size-based metric values increased the classification performance. Moreover, just-in-time quality assurance (Kamei et al. 2013) builds on the assumption that changes and metrics derived from these changes can predict bug introduction, meaning there should be a difference. Therefore, we hypothesize that corrective as well as perfective maintenance consist of smaller changes. Addition of features should be larger than both, and therefore we assume that the categories we are interested in, perfective and corrective, are smaller than other non-perfective and non-corrective changes.
- H2: Intended quality improvements impact software quality metric values in a positive way. In this paper, we focus on metrics used in the Columbus Quality Model (Bakota et al. 2011, 2014). The metrics are specifically chosen for a quality model so they should provide different measurements based on their maintenance category. Prior research, e.g., Ch’avez et al. (2017) and Stroggylos and Spinellis (2007) found that refactorings, which are part of our classification, have a measurable impact on software metric values. We hypothesize that an improvement consciously applied by a developer via a perfective commit has a measurable, positive impact on software metric values. Positive means that we expect a value change direction of the metric value, e.g., complexity is reduced. We note our expected direction for each metric together with a description in Table 4.
  
  Defect prediction research assumes a connection between software metrics and external software quality in the form of bugs. While most publications in defect prediction are not investigating the impact of single bug fixing changes the most common datasets all contain coupling, size and complexity metrics as independent variables, e.g., Jureczko and Madeyski (2010), NASA (2004), and D’Ambros et al. (2012), see also the systematic literature review by Hosseini et al. (2017). We hypothesize that fixing bugs via corrective commits has a measurable, positive impact on software metric values. While a bug fix may add complexity, our study compares bug fix changes with all non-corrective changes including feature additions. Therefore, we do not hypothesize that bug fixing decreases complexity generally, but that it is decreasing complexity in comparison to all non-corrective changes. In contrast to H1 we are not able to compare our results to concrete studies as we are not aware of a study that investigates metric value changes of perfective and corrective changes and compares them against all other non-perfective and non-corrective changes. We are instead trying to validate the assumption that quality improvements should have a positive impact on software quality metrics as they are found to improve detection of defects (Gyimothy et al. 2005).

Our second research question is exploratory in nature.

RQ2: What kind of files are the target of internal or external quality improvements? The first part of our study provides us with information about metric value changes for quality increasing commits. In this part, we are exploring which files are the target of quality increasing commits. We are interested in how complex, e.g., via cyclomatic complexity, a file is on average that receives perfective maintenance. Moreover, on the external quality side we are interested in which files are receiving corrective changes. Due to the exploratory nature of this research question, we do not derive hypotheses.

3 Related Work

We separate the discussion of the related work into publications on the classification of changes, publications on the relation between quality improvements and software metrics and publications with a focus on the commit message.

Most prior work that follows a similar approach to ours is concerned with specific types of quality improving changes, e.g., refactoring and removal of code smells. We note that some code smell detection is based on internal software quality metrics, which we use in our study.

We first present previous research related to the first phase of our study, i.e., classification of changes with respect to maintenance types. Mockus (2000) study changes in a large system and identified reasons for changes. They find that a textual description of the change can be used to identify the type of change with a keyword based approach which they validated with a developer survey. The authors classified changes to Swansons maintenance types. They find that corrective and perfective changes are smaller and that perfective changes delete more lines than other changes. Mauczka et al. (2012) present an automatic keyword based approach for classification into Swansons maintenance types. They evaluate their approach and provide a keyword list for each maintenance type together with a weight.

Fu et al. (2015) present an approach for change classification that uses latent drichtlet allocation. They study five open source projects and classify changes into Swansons maintenance types together with a not sure type. The keyword list of their study is based on Mauczka et al. (2012).

Mauczka et al. (2015) collect developer classifications for three different classification schemes. Their data contains 967 commits from six open source projects. While the developers themselves are the best source of information, we believe that within the guidelines of our approach our classifications are similar to those of the developers. We evaluate this assumption in Section 4.2.

Yan et al. (2016) use discriminative topic modeling also based on the keyword list by Mauczka et al. (2012). They focus on changes with multiple categories. Levin and Yehudai (2017) improve maintenance type classification by utilizing source code in addition to keywords. This is an indication that metric values which are computed from source code are impacted by different maintenance types.

Hönel et al. (2019) use size metrics as additional features for automated classification of changes. In our study, we first classify the change and then look at how this impacts size and spread of the change. However, the differences we found in our study support the assumption that size-based features can be used to distinguish change categories.

More recently, Wang et al. (2021) also analyze developer intents from the commit messages. They focus on large review effort code changes instead of quality changes or maintenance types. They also use a keyword based heuristic for the classification. They do not, however, include a perfective maintenance classification.

Ghadhab et al. (2021) also use a deep learning model to classify commits. They use word embeddings from the deep learning model in combination with fine-grained code changes to classify into Swansons maintenance categories. In contrast to Ghadhab et al., we do not include code changes in our automatic classifications and focus on the commit message.

The classification of changes for the ground truth in our study is based on manual inspection by two researchers instead of a keyword list. We specify guidelines for the classification procedure which enable other researchers to replicate our work. To accept or reject our hypotheses, we only inspect internal and external quality improvements which would correspond to the perfective and corrective maintenance types by Swanson. In contrast to the previous studies, we relate our classified changes also to a set of static software metrics.

We now present research related to our second phase of our study, the relation between intended quality improvements and software metrics. Stroggylos and Spinellis (2007) found changes where the developers intended a refactoring via the commit message. The authors then measured several source code metrics to evaluate the quality change. In contrast to the work of Stroggylos and Spinellis (2007), we do not focus on refactoring keywords. Instead, we consider refactoring as a part of our classification guidelines. Moreover, our aim is to investigate whether the metrics most commonly used as internal quality metrics (see also ; Al Dallal and Abdin 2018) are the ones that are changing if developers perform quality improving changes including refactoring.

Fakhoury et al. (2019) investigate the practical impact of software evolution with developer perceived readability improvements on existing readability models. After finding target commits via commit message filtering, they applied state-of-the-art readability models before and after the change and investigate the impact of the change on the resulting readability score.

Pantiuchina et al. (2018) analyze commit messages to extract the intent of the developer to improve certain static source code metrics related to software quality. In contrast to their work, we are not extracting the intent to improve certain static code metrics but instead focus on overall improvement to measure the delta of a multitude of metrics between the improving commit and its parents. Developers may not use the terminology Pantiuchina et al. base their keywords on, e.g., instead of writing reduce coupling or increase cohesion the developer may simply write refactoring or simplify code.

In contrast to the previous studies, we relate developer intents to improve the quality either by perfective maintenance or by corrective maintenance to change size metrics and static source code metrics. In addition, we also look at mean static source code metrics per file which are the target of quality improvements.

As the commit message is used to extract the intent of the developer in our study, we also briefly discuss related work on commit message contents. Most of that work that is not already covered previous sections builds and evaluates a quality model for the commit message. The proposed quality models are not suitable for our study as is, as they only determine general commit message quality and we use the message to classify the commit to one of three types. However, they still provide interesting data considering the content of the commit messages.

Santos and Hindle (2016) investigate whether unusual commit messages correlate with build failures using an n-gram language model. The authors find, that their language model is able to identify unusual commit messages. However, they did not find a significant correlation between unusualness of a commit message as determined by the cross-entropy of their language model and build failures.

Chahal and Saini (2018) analyze the impact of community dynamics on syntactic quality of commit messages. They define a commit message quality model and use the model to relate community dynamic metrics to commit message quality. They find that a small group of contributors active at the same time can lead to a high quality of commit messages.

Tian et al. (2022) study commit messages in five open source projects and find, that an average of about 44% messages could be improved. They proposed a classification model for quality of commit messages after manually classifying 1600 commits. In their multi-method study the authors also provide a taxonomy of commit messages with expression categories. They find, that between 0.9% and 7.5% of commit messages do neither contain what was changed nor why the change was applied.

4 Case Study Design

The goal of our case study is to gather empirical data about what changes when a developer intends to improve the quality of the code base in comparison to their counterpart, e.g., what changes in perfective commits in comparison to all other, i.e., non-perfective commits.

To achieve this, we first sample a number of commits from our selected study subjects. This sample is classified by two researchers into two categories of quality improving and other changes. The classification into categories is only done via the commit message as it expresses the intent of the developer on what the change should achieve.

This data is then used to train a model that can confidently classify the rest of our commit messages. The classified commits are then used to investigate the static source code metric value changes to accept or reject our hypotheses in the confirmatory part of our study. After that, we investigate the metric values before the change is applied in the exploratory part of our study.

4.1 Data and Study Subject Selection

The data used in our study is a SmartSHARK (Trautsch et al. 2017) database taken from Trautsch et al. (2020a). We use all projects and commits in the database. However, only commits that change production code and which are not empty are considered. For each change in our data, we extract a list of changed files, the number of changed lines, the number of hunks,^{Footnote 1} and the delta as well as the previous and current value of source code metrics from the changed files between the parent and the current commit. To create our ground truth sample, we randomly sample 2% of commits per project rounded up for manual classification.

The data consists of Java open source projects under the umbrella of the Apache Software Foundation.^{Footnote 2} All projects use an issue tracking system and were still active when the data was collected. Each project consist of at least 100 files and 1000 commits and is at least two years old. Table 1 shows every project, the number of commits and the years of data we consider for sampling. In addition, we include the number of perfective and corrective commits for our ground truth and final classification.

Table 1 Case study subjects with time frame and distribution of commits

What really changes when developers intend to improve their source code: a commit-level study of static metric value and static analysis warning changes

Abstract

Similar content being viewed by others

Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect prediction

Measuring code maintainability with deep neural networks

A Neural Architecture for Detecting Identifier Renaming from Diff

Explore related subjects

1 Introduction

2 Research Questions and Hypotheses

3 Related Work

4 Case Study Design

4.1 Data and Study Subject Selection

4.2 Change Type Classification Guidelines

4.3 Deep Learning for Commit Intent Classification

4.4 Metric Selection

4.5 Analysis Procedure

4.6 Replication Kit

5 Results

5.1 Confirmatory Study

5.1.1 Results H1: Intended Quality Improvements are Smaller than Non-perfective and Non-corrective Changes

5.1.2 Results H2: Intended Quality Improvements Impact Software Quality Metric Values in a Positive Way

5.2 Summary RQ1

5.3 Exploratory Study

6 Discussion

6.1 Implications for Researchers

6.2 Implications for Practitioners

7 Threats to Validity

7.1 Reliability

7.2 Construct Validity

7.3 Conclusion Validity

7.4 Internal Validity

7.5 External Validity

8 Conclusion

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix: Ground Truth Only Results

Appendix: Ground Truth Only Results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation