1 Introduction

Software quality is notoriously hard to measure (Kitchenham and Pfleeger 1996). The main reason is that quality is subjective and that it consists of multiple factors. This idea was formalized by Boehm and McCall in the 70s (Boehm et al. 1976; McCall et al. 1977). Both introduced a layered approach where software quality consists of multiple factors. The standard ISO/IEC 9126 (2001) and successor ISO/IEC 25010 (2011) also approach software quality in this fashion.

All these ideas contain abstract quality factors. However, the question remains what concrete measurements can we perform to evaluate the abstract factors of which software quality consists, i.e., how do we measure software quality. Some software quality models recommend concrete measurements, e.g., ColumbusQM (Bakota et al. 2011) and Quamoco (Wagner et al. 2012). Defect prediction researchers also try to build (machine learning) models to find a function that can map measurable metrics to the number of defects in the source code. This can also be thought of as software quality evaluation, that tries to map internal software quality, measured by code or process metrics, to external software quality measured by defects (Fenton and Bieman 2014). The internal and external quality categories can also be mapped to perfective and corrective maintenance categories after Swanson (1976). Perfective maintenance should increase internal quality while corrective maintenance should increase external quality. Both categories should increase the overall quality of the software. To ease the readability, we adopt the perfective and corrective terms defined by Swanson for the rest of the paper when referring to the categories. For general assumptions, we adopt the internal and external quality terms. Internal quality represents what the developer sees, e.g., structure, size, and complexity while external quality what the user sees, e.g., defects.

Software quality models and defect prediction models use static source code metrics as a proxy for quality (Hosseini et al. 2017). The intuition is that complex code, as measured by static source code metrics, is harder to reason about and, therefore, is more prone to errors. However, recent research by Peitek et al. (2021) showed that measured code complexity is perceived very differently between developers and does not translate well to code understanding. A similar result was found by Scalabrino et al. (2021) although their work is focused on readability measured in a static way. Both studies, due to their nature, observe developers in a controlled experiment with code snippets. To supplement these results, it would be interesting to measure what developers change in their code “in the wild” to improve software quality and if their intent matches what we can measure, e.g., if complexity is reduced in a change that intends to improve quality.

While there are multiple publications on maintenance or change classification after Swanson (1976), e.g., Mockus (2000), Mauczka et al. (2012), Levin and Yehudai (2017) and Hönel et al. (2019), we are not aware of publications that investigate differences between multiple software metrics for corrective and perfective maintenance as well as their counterparts, i.e., non-perfective and non-corrective. The inclusion of these counterparts results in computational effort as we need every metric for every file in every commit. However, we are able to provide this data via the SmartSHARK ecosystem (Trautsch et al. 2017, 2020b). This additional effort allows us to infer if categories of changes are different when regarding all changes of a software project. Most recent work focuses on certain aspects instead of a generic overview, e.g., how software metric values change when code smells are removed (Bavota et al. 2015) or refactorings are applied (Bavota et al. 2015; Alshayeb 2009; Pantiuchina et al. 2020).

However, we believe that taking a step back from focused approaches and investigating generic quality improvements is worthwhile. A generic overview has the advantage of mitigating possible problems that can occur for narrow meaning keywords of topically focused approaches while at the same time providing a cohesive overview. Moreover, it allows for generic statements about software quality evolution based on this information and can complement focused approaches.

In this work, we find changes that increase the quality, while we measure current, previous and delta of common source code metric values used in a current version (Bakota et al. 2014) of the Columbus quality model (Bakota et al. 2011). We use the commit message contained in each change to find commits where the intent of the developer is to improve software quality. This provides us with a view of corrective and perfective maintenance commits.

Within our study, we first classify the commit intent for a sample of 2,533 commits from 54 open source projects manually. The manual classification is provided by two researchers according to predefined guidelines. According to the overview of previous research in this area provided by AlOmar et al. (2021) our study would be the largest manual classification study of commits. We use this data as ground truth to fine-tune a state-of-the-art deep learning model for natural language processing that was pre-trained exclusively on software engineering data (von der Mosel et al. 2022). After we determine the performance of the model, we classify all commits, increasing our data to 125,482 commits.

We use the automatically classified data to conduct a two part study. The first part is a confirmatory study into the expected behavior of metric values for quality increasing changes. Expected behaviour, e.g., complexity is reduced in quality increasing changes, is derived as hypothesis from existing quality models and the related literature.

In case our data matches the expected behavior from the literature, we can confirm the postulated theories and provide evidence in favor of using the measurements. Otherwise, we try to establish which metrics may be unsuitable for quality estimation, including the potential reasons. Even further, we determine whether metrics used in software quality models are impacted by quality increasing maintenance, therefore providing an evaluation for software quality measurement metrics.

The second part of our study is of exploratory nature. We investigate which files are the target of quality improvements by the developers. We explore whether only complex files are receiving perfective changes and which metric values are indicative of corrective changes. This provides us with data for practitioners and static analysis tool vendors for boundary values which are likely to have a positive impact on the quality of source code from the perspective of the developers.

Overall, our work provides the following contributions:

  • A large data set of manual classifications of commit intents with improving internal and external quality categories.

  • A confirmatory study of size and complexity metric value as well as static analysis warning changes for quality improvements.

  • An exploratory study of size and complexity metric values as well as static analysis warnings of files that are the target of quality improvements.

  • A fine-tuned state-of-the-art deep learning model for automatic classification of commit intents.

The main findings of our study are the following:

  • We confirm previous work that quality increasing commits are smaller than changes unrelated to quality.

  • While perfective changes have a positive impact on most static source code metric values and static analysis warnings, corrective changes have a negative impact on size and complexity.

  • The files that are the target of perfective changes are already less complex and smaller than files which are not the target of perfective changes.

  • The files that are the target of corrective changes are more complex and larger than files which are not the target of corrective changes.

The remainder of this paper is structured as follows. In Section 2, we define our research questions and hypotheses. In Section 3, we discuss the previous work related to our study. Section 4 contains our case study design with descriptions for subject selection as well as data sources and analysis procedure. In Section 5, we present the results of our case study and discuss them in Section 6. Section 7 lists our identified threats to validity and Section 8 closes with a conclusion of our work.

2 Research Questions and Hypotheses

In our study, we answer two research questions.

  • RQ1: Does developer intent to improve internal or external quality have a positive impact on software metric values? Previous work provides us with certain indications about the impact on software metric values. This is part of our confirmatory study, and we derive two hypotheses from previous work regarding how size and software metric values should change for different types of quality improvement. We formulate our assumptions as hypothesis and test these in our case study.

    • H1: Intended quality improvements are smaller than non-perfective and non-corrective changes. Mockus (2000) found that corrective changes modify fewer lines while perfective changes delete more lines. Purushothaman and Perry (2005) also observed more deletions for perfective maintenance and an overall smaller size of perfective and corrective maintenance. Both studies provide measurements we base our hypothesis on. While they are using the same closed source project we will be able to see if our assumption holds for our multiple Java open source projects.

      Hönel et al. (2019) used size-based metrics as additional features for an automated approach to classify maintenance types. They found that the size-based metric values increased the classification performance. Moreover, just-in-time quality assurance (Kamei et al. 2013) builds on the assumption that changes and metrics derived from these changes can predict bug introduction, meaning there should be a difference. Therefore, we hypothesize that corrective as well as perfective maintenance consist of smaller changes. Addition of features should be larger than both, and therefore we assume that the categories we are interested in, perfective and corrective, are smaller than other non-perfective and non-corrective changes.

    • H2: Intended quality improvements impact software quality metric values in a positive way. In this paper, we focus on metrics used in the Columbus Quality Model (Bakota et al. 2011, 2014). The metrics are specifically chosen for a quality model so they should provide different measurements based on their maintenance category. Prior research, e.g., Ch’avez et al. (2017) and Stroggylos and Spinellis (2007) found that refactorings, which are part of our classification, have a measurable impact on software metric values. We hypothesize that an improvement consciously applied by a developer via a perfective commit has a measurable, positive impact on software metric values. Positive means that we expect a value change direction of the metric value, e.g., complexity is reduced. We note our expected direction for each metric together with a description in Table 4.

      Defect prediction research assumes a connection between software metrics and external software quality in the form of bugs. While most publications in defect prediction are not investigating the impact of single bug fixing changes the most common datasets all contain coupling, size and complexity metrics as independent variables, e.g., Jureczko and Madeyski (2010), NASA (2004), and D’Ambros et al. (2012), see also the systematic literature review by Hosseini et al. (2017). We hypothesize that fixing bugs via corrective commits has a measurable, positive impact on software metric values. While a bug fix may add complexity, our study compares bug fix changes with all non-corrective changes including feature additions. Therefore, we do not hypothesize that bug fixing decreases complexity generally, but that it is decreasing complexity in comparison to all non-corrective changes. In contrast to H1 we are not able to compare our results to concrete studies as we are not aware of a study that investigates metric value changes of perfective and corrective changes and compares them against all other non-perfective and non-corrective changes. We are instead trying to validate the assumption that quality improvements should have a positive impact on software quality metrics as they are found to improve detection of defects (Gyimothy et al. 2005).

Our second research question is exploratory in nature.

  • RQ2: What kind of files are the target of internal or external quality improvements? The first part of our study provides us with information about metric value changes for quality increasing commits. In this part, we are exploring which files are the target of quality increasing commits. We are interested in how complex, e.g., via cyclomatic complexity, a file is on average that receives perfective maintenance. Moreover, on the external quality side we are interested in which files are receiving corrective changes. Due to the exploratory nature of this research question, we do not derive hypotheses.

3 Related Work

We separate the discussion of the related work into publications on the classification of changes, publications on the relation between quality improvements and software metrics and publications with a focus on the commit message.

Most prior work that follows a similar approach to ours is concerned with specific types of quality improving changes, e.g., refactoring and removal of code smells. We note that some code smell detection is based on internal software quality metrics, which we use in our study.

We first present previous research related to the first phase of our study, i.e., classification of changes with respect to maintenance types. Mockus (2000) study changes in a large system and identified reasons for changes. They find that a textual description of the change can be used to identify the type of change with a keyword based approach which they validated with a developer survey. The authors classified changes to Swansons maintenance types. They find that corrective and perfective changes are smaller and that perfective changes delete more lines than other changes. Mauczka et al. (2012) present an automatic keyword based approach for classification into Swansons maintenance types. They evaluate their approach and provide a keyword list for each maintenance type together with a weight.

Fu et al. (2015) present an approach for change classification that uses latent drichtlet allocation. They study five open source projects and classify changes into Swansons maintenance types together with a not sure type. The keyword list of their study is based on Mauczka et al. (2012).

Mauczka et al. (2015) collect developer classifications for three different classification schemes. Their data contains 967 commits from six open source projects. While the developers themselves are the best source of information, we believe that within the guidelines of our approach our classifications are similar to those of the developers. We evaluate this assumption in Section 4.2.

Yan et al. (2016) use discriminative topic modeling also based on the keyword list by Mauczka et al. (2012). They focus on changes with multiple categories. Levin and Yehudai (2017) improve maintenance type classification by utilizing source code in addition to keywords. This is an indication that metric values which are computed from source code are impacted by different maintenance types.

Hönel et al. (2019) use size metrics as additional features for automated classification of changes. In our study, we first classify the change and then look at how this impacts size and spread of the change. However, the differences we found in our study support the assumption that size-based features can be used to distinguish change categories.

More recently, Wang et al. (2021) also analyze developer intents from the commit messages. They focus on large review effort code changes instead of quality changes or maintenance types. They also use a keyword based heuristic for the classification. They do not, however, include a perfective maintenance classification.

Ghadhab et al. (2021) also use a deep learning model to classify commits. They use word embeddings from the deep learning model in combination with fine-grained code changes to classify into Swansons maintenance categories. In contrast to Ghadhab et al., we do not include code changes in our automatic classifications and focus on the commit message.

The classification of changes for the ground truth in our study is based on manual inspection by two researchers instead of a keyword list. We specify guidelines for the classification procedure which enable other researchers to replicate our work. To accept or reject our hypotheses, we only inspect internal and external quality improvements which would correspond to the perfective and corrective maintenance types by Swanson. In contrast to the previous studies, we relate our classified changes also to a set of static software metrics.

We now present research related to our second phase of our study, the relation between intended quality improvements and software metrics. Stroggylos and Spinellis (2007) found changes where the developers intended a refactoring via the commit message. The authors then measured several source code metrics to evaluate the quality change. In contrast to the work of Stroggylos and Spinellis (2007), we do not focus on refactoring keywords. Instead, we consider refactoring as a part of our classification guidelines. Moreover, our aim is to investigate whether the metrics most commonly used as internal quality metrics (see also ; Al Dallal and Abdin 2018) are the ones that are changing if developers perform quality improving changes including refactoring.

Fakhoury et al. (2019) investigate the practical impact of software evolution with developer perceived readability improvements on existing readability models. After finding target commits via commit message filtering, they applied state-of-the-art readability models before and after the change and investigate the impact of the change on the resulting readability score.

Pantiuchina et al. (2018) analyze commit messages to extract the intent of the developer to improve certain static source code metrics related to software quality. In contrast to their work, we are not extracting the intent to improve certain static code metrics but instead focus on overall improvement to measure the delta of a multitude of metrics between the improving commit and its parents. Developers may not use the terminology Pantiuchina et al. base their keywords on, e.g., instead of writing reduce coupling or increase cohesion the developer may simply write refactoring or simplify code.

In contrast to the previous studies, we relate developer intents to improve the quality either by perfective maintenance or by corrective maintenance to change size metrics and static source code metrics. In addition, we also look at mean static source code metrics per file which are the target of quality improvements.

As the commit message is used to extract the intent of the developer in our study, we also briefly discuss related work on commit message contents. Most of that work that is not already covered previous sections builds and evaluates a quality model for the commit message. The proposed quality models are not suitable for our study as is, as they only determine general commit message quality and we use the message to classify the commit to one of three types. However, they still provide interesting data considering the content of the commit messages.

Santos and Hindle (2016) investigate whether unusual commit messages correlate with build failures using an n-gram language model. The authors find, that their language model is able to identify unusual commit messages. However, they did not find a significant correlation between unusualness of a commit message as determined by the cross-entropy of their language model and build failures.

Chahal and Saini (2018) analyze the impact of community dynamics on syntactic quality of commit messages. They define a commit message quality model and use the model to relate community dynamic metrics to commit message quality. They find that a small group of contributors active at the same time can lead to a high quality of commit messages.

Tian et al. (2022) study commit messages in five open source projects and find, that an average of about 44% messages could be improved. They proposed a classification model for quality of commit messages after manually classifying 1600 commits. In their multi-method study the authors also provide a taxonomy of commit messages with expression categories. They find, that between 0.9% and 7.5% of commit messages do neither contain what was changed nor why the change was applied.

4 Case Study Design

The goal of our case study is to gather empirical data about what changes when a developer intends to improve the quality of the code base in comparison to their counterpart, e.g., what changes in perfective commits in comparison to all other, i.e., non-perfective commits.

To achieve this, we first sample a number of commits from our selected study subjects. This sample is classified by two researchers into two categories of quality improving and other changes. The classification into categories is only done via the commit message as it expresses the intent of the developer on what the change should achieve.

This data is then used to train a model that can confidently classify the rest of our commit messages. The classified commits are then used to investigate the static source code metric value changes to accept or reject our hypotheses in the confirmatory part of our study. After that, we investigate the metric values before the change is applied in the exploratory part of our study.

4.1 Data and Study Subject Selection

The data used in our study is a SmartSHARK (Trautsch et al. 2017) database taken from Trautsch et al. (2020a). We use all projects and commits in the database. However, only commits that change production code and which are not empty are considered. For each change in our data, we extract a list of changed files, the number of changed lines, the number of hunks,Footnote 1 and the delta as well as the previous and current value of source code metrics from the changed files between the parent and the current commit. To create our ground truth sample, we randomly sample 2% of commits per project rounded up for manual classification.

The data consists of Java open source projects under the umbrella of the Apache Software Foundation.Footnote 2 All projects use an issue tracking system and were still active when the data was collected. Each project consist of at least 100 files and 1000 commits and is at least two years old. Table 1 shows every project, the number of commits and the years of data we consider for sampling. In addition, we include the number of perfective and corrective commits for our ground truth and final classification.

Table 1 Case study subjects with time frame and distribution of commits

4.2 Change Type Classification Guidelines

As we are not relying on a keyword based approach and there is no existing guideline for this kind of classification, we created a guideline based on Herzig et al. (2013). Our ground truth consists of a sample of changes which we manually classified into perfective, corrective, and other changes. We do not consider adaptive changes as separate a category. Instead, we include them in the other changes. The reason is that we focus on internal and external quality improvements and map perfective to internal quality and corrective to external quality. Every commit message is inspected independently by two researchers with software development experience. The inspection is using a graphical frontend that loads the sample and displays the commit message which can then be assigned a label by each researcher independently. If the commit message does not provide enough information, we inspect additional linked information in the form of bug reports or the change itself. In case of a link between the commit message and the issue tracking system, we inspect the bug report and determine if it is a bug according to the guidelines by Herzig et al. (2013). We perform this step because the reporter of a bug sometimes assigns a wrong type. We defined the guidelines listed in Table 2 used by both researchers for the classification of changes. The deep learning model for our final classification of intents only receives the commit messages. This is a conscious trade-off. On the one hand we want the ground truth to be as exact as possible, on the other hand we want to keep the automatic intent classification as simple as possible. The results of our fine-tuning evaluation (Table 3) show that the model does not need the additional data from changes and issue reports to perform well.

Table 2 Classification rules and examples, footnotes denote different commit messages from our data
Table 3 Change classification model performance comparison

Both researchers achieve a substantial inter-rater agreement (Landis and Koch 1977) with a Kappa score of 0.66 (Cohen 1960). Disagreements are discussed and assigned a label both researchers agree upon after discussion. The disagreement front end shows both prior labels anonymized in random order.

In contrast to the classification by Mauczka et al. (2015) and Hattori and Lanza (2008), we do not categorize release tagging, license or copyright corrections as perfective. Our rationale is that these changes are not related to the code quality, which is our main interest in this study.

In Mauczka et al. (2015) the researchers selected six projects and seven developers with personal commitment and provided the developers with the commit messages that they then labeled according to different classification schemes. One of which is the Swanson classification which matches our study. Each developer labeled a sample of commit messages from their respective project. As we are focused on Java we also use the Java projects of the Mauczka et al. (2015) dataset to validate our guidelines.

Two authors of this paper re-classified the Java projects from Mauczka et al. (2015): Deltaspike, Mylyn-reviews and Tapiji. The commit messages were classified separately first. Disagreements were then resolved together in a separate session. In the first session both authors achieve a substantial inter-rater agreement (Landis and Koch 1977) with a Kappa score of 0.62 (Cohen 1960).

Aside from the classification differences regarding release tagging, license or copyright changes, we noticed further differences. Several commits contain some variation of “minor bugfixes” which are classified as perfective maintenance by the developers or both corrective and perfective, whereas we classify them as corrective. Additionally, code removal or test additions were not classified as perfective changes by the developers, but rather as corrective changes. This reveals a difference of perspective between researchers and developers. We consider pure code removal and test additions as perfective instead of corrective as we think of corrective changes as improving external quality, e.g., by fixing a customer facing bug. The data also contains clean-up and removal messages without a hint of an underlying bug which are classified as corrective by the developers. Based on the information available to us, we cannot decide if these are misclassifications by the developers, the result of differences in the classification guidelines, or misclassifications by us due to lack of in-depth knowledge about the projects.

The authors achieve a substantial inter-rater agreement (Landis and Koch 1977) with the developers yielding a Kappa score of 0.63 (Cohen 1960).

4.3 Deep Learning for Commit Intent Classification

In order to use all available data, we use a deep learning model that classifies all data which is not manually classified into perfective, corrective or other. Due to the size of state-of-the-art deep learning models and the computing requirements for training them, a current best practice is to use a pre-trained model which was trained unsupervised on a large data set. The model is then fine-tuned on labeled data for a specific task.

To achieve a high performance, we use seBERT (von der Mosel et al. 2022), a model that is pre-trained on textual software engineering data in two common Natural Language Processing (NLP) tasks. Masked Language Model (MLM) and Next Sentence Prediction (NSP) which predict randomly masked words in a sentence and the next sentence respectively. Combined, this allows the model to learn a contextual understanding of the language. While von der Mosel et al. (2022) include a similar benchmark based on our ground truth data, it only used the perfective label, i.e., a binary classification to demonstrate text classification for software engineering data. In our study, we measure performance of the multi-class case with all three labels, perfective, corrective and other. Within this study, we first use our ground truth data to evaluate the multi-class performance of the model. We perform a 10 × 10 cross-validation which splits our data into 10 parts and uses 9 for fine-tuning the model and one for evaluating the performance. The fine-tuning itself splits the data into 80% training and 20% validation. The model is then fine-tuned and evaluated on the validation data for each epoch. At the end the best epoch is chosen to classify the test data of the fold. This is repeated 10 times for every fold which yields 100 performance measurements.

Our experiment shows sufficient performance comparable to other state-of-the-art models for commit classification. We provide the final fine-tuned model as well as the fine-tuning code as part of our replication kit for other researchers. Performance wise our model is comparable to Ghadhab et al. (2021) and improves performance compared other studies, e.g., Gharbi et al. (2019) and Levin and Yehudai (2017). However, we note that we fine-tuned the model with only the labels used in our study, i.e., perfective, corrective and other. Therefore, it cannot be used or directly compared with models that support other commit classification labels. This would require the same data and labels, we can only compare the given model performance metrics, which we do in Table 3. If we look at the overview of commit classification studies by AlOmar et al. (2021) we can see that our model outperforms the other models for comparable tasks where accuracy or F-measure is given. While this is evidence that our model can perform our required commit intent classification a throughout comparison of different commit intent classification approaches is not within the scope of this study.

4.4 Metric Selection

The metric selection is based on the Columbus software quality model by Bakota et al. (2011). The metrics are selected from the current version of the model also in use as QualityGate (Bakota et al. 2014). The current model consists of 14 static source code metrics related to size, complexity, documentation, re-usability and fault-proneness. While the quality model provides us with a selection of metrics, we do not use it directly as it requires a baseline of projects before estimating quality of a candidate project.

Table 4 shows the metrics utilized in this study, a short description, and the direction which we assume they change in quality improving commits. As most of the metrics are size and complexity metrics, we expect that their values decrease in comparison to all other commits. The metrics we expect to increase in quality improving commits are commented lines of code, comment density, and API documentation, as added documentation should increase these metrics. The three bottom rules consist of static analysis warnings from PMDFootnote 3 aggregated by severity for every file. We are of the opinion that this selection strikes a good balance of size, complexity, documentation, clone, and coupling based metrics.

Table 4 Static source code metrics and static analysis warning severities used in this study including the expected direction of their values in quality increasing commits

As we are interested in static source code metrics in a commit granularity, we sum the metrics values for all files that are changed within a commit. In addition, we extract meta information about each change. The static source code metrics are provided by a SmartSHARK plugin using the OpenStaticAnalyzer.Footnote 4 To answer our research question, we provide the delta of the metric value changes as well as their current and previous value.

4.5 Analysis Procedure

For our confirmatory study as part of RQ1, we compare the difference between two samples. To choose a valid statistical test of whether there is a difference between both samples, we first perform the Shapiro-Wilk test (Wilk and Shapiro 1965) to test for normality of each sample. Since we found that the data is non-normal, we perform the Mann-Whitney U-test (Mann and Whitney 1947) to evaluate if the metric values of one population dominates the other. Since we have an expectation about the direction of metric changes, we perform a one-sided Mann-Whitney U test. The H0 hypothesis is that both samples are the same, the alternative hypothesis is that one sample contains lower or higher values depending on our expectation. The expected direction of the metric value change is noted in the last column of Table 4.

As our data contains a large number of metrics, we cannot assume a statistical test with p < 0.05 is a valid rejection of a H0 hypothesis. To mitigate the problem posed by a high number of statistical tests, we perform Bonferroni correction (Abdi 2007). We choose a significance level of α = 0.05 with Bonferroni correction for 192 statistical tests. They consist of four size metrics with two groups and three statistical tests as well as 14 source code metrics with two groups and three statistical tests (normality tests for two samples and Mann-Whitney U for difference between samples). The second part is repeated for RQ2. We reject the H0 hypothesis that there is no difference between samples at p < 0.00026.

To calculate the effect size of the Mann-Whitney U test, we use Cliff’s d (Cliff 1993) as a non-parametric effect size measure. We follow a common interpretation of d values (Griessom and Kim 2005): d < 0.10 is negligible, 0.10 ≤ d < 0.33 is small, 0.33 ≤ d < 0.474 is medium and d ≥ 0.474 is large. We provide the effect size for every difference that is statistically significant.

We report the results visually with box plots. The box plots shows three groups: all, perfective and corrective, this allows us to show the values for each metric for each group and serves to highlight the differences. Additionally, we report the differences between each group and its counterpart, e.g., perfective and non-perfective in the tables where we report the statistical differences.

A more detailed description of the procedure for each hypothesis follows. For H1, we compare the structure of quality improving changes with every non-perfective and non-corrective change. We compare the size (changed lines) and diffusion (number of hunks, number of changed files) to evaluate the hypothesis. We visualize the results with box plots and report results for statistical tests to determine if the difference in samples is statistically significant.

For H2, we also visualize the results via box plots. As most of the differences hover around zero, we transform the data before plotting via \(sign(x)\cdot \log (abs(x + 1))\). As we are interested in the differences between changes of metric values, we also require x≠ 0 : ∀xX where X is the complete, non-transformed data set for the visualizations. Due to the difference in changes, we provide our data size corrected, e.g., the delta of McCC is divided by the modified lines. Additionally, we report the percentage of data that is non-zero to indicate how often the measurements are changing in our data. In addition to the visualization, we provide a table with differences between the samples and statistical test results.

As part of our exploratory study for answering RQ2, we also provide box plots of our metric values. Instead of transformed delta values, we provide the raw averages per file in a change before the change was applied. In addition, we provide the median values of all of our metrics before the change was applied. In this part, we apply a two-sided Mann-Whitney U test as we have no expectation of the direction the metrics change into for the categories. To complement the visualization, we also provide density plots for both categories. They show the overlap between the perfective and corrective changes.

4.6 Replication Kit

All data and source code can be found in our replication kit (Trautsch et al. 2021). In addition, we provide a small website for this publication that contains all information and where the fine-tuned model can be tested live.Footnote 5

5 Results

In this section, we first present the results for evaluating our hypotheses of our first research question. After that, we describe the results of the exploratory part of our study for our second research question.

5.1 Confirmatory Study

We first present the results of our confirmatory study and evaluate our hypotheses. These results answer our first research question: Does developer intent to improve internal or external quality have a positive impact on software metric values?

5.1.1 Results H1: Intended Quality Improvements are Smaller than Non-perfective and Non-corrective Changes

Figure 1 shows the distribution of sizes between perfective, corrective, and all commits. Table 5 shows the statistical test results for the differences between perfective and non-perfective as well as corrective and non-corrective commits. We can see that perfective commits tend to add fewer lines but instead remove more lines as the non-perfective commits. When we calculate a median delta between all commits and perfective commits, we find a difference of 28 for added lines and -2 for deleted lines. While the effect sizes are negligible to small, we can see this difference also in Fig. 1. The diffusion of the change over files is also different, however for the number of modified files the difference is not significant for perfective commits.

Fig. 1
figure 1

Commit size distribution over all projects for all, perfective and corrective commits. Fliers are omitted

Table 5 Statistical test results for perfective and corrective commits, Mann-Whitney U test p-values (p-val) and effect size (d) with category, n is negligible, s is small

Corrective commits also tend to add less code, while they do not delete as much, the difference in added and deleted lines is also statistically significant. While the effect size is small, we can see the difference in Fig. 1. For corrective commits, we can also see a difference in the number of files changed and the number of hunks modified. This diffusion of the change via the number of files and hunks is also statistically significant although, again, with a small effect size.

We can conclude, that perfective commits tend to remove more lines, and are generally adding fewer lines to the repository. Corrective commits delete fewer lines and add fewer lines than non-corrective commits. Corrective commits are also distributed over fewer hunks and fewer files than non-corrective commits.

figure e

5.1.2 Results H2: Intended Quality Improvements Impact Software Quality Metric Values in a Positive Way

We first note that no metric value changes for each instance of our data. This can be seen in Table 6, which shows the percentages for each metric value for perfective, corrective, and all changes. We can see some differences between changes, e.g., critical PMD warnings only change in about 7% of commits while LLOC changes in about 75%. Some differences are also between categories, e.g., McCC changes in 31% of perfective changes and in 57% of corrective changes.

Table 6 Percentage of commits where the metric value does change on all commits (%NZ), perfective commits (%NZ P) and corrective commits (%NZ C)

To evaluate H2, we present the differences in all changes visually as box plots in Fig. 2, which shows the metric values for all commits, only perfective and only corrective.

Fig. 2
figure 2

Static source code metric value changes in all, perfective and corrective commits divided by changed lines. Fliers are omitted

In addition, we provide Table 7 which shows the Mann-Whitney U test (Mann and Whitney 1947) p-values, and effect sizes for differences between the types of commits. The differences that are compared in Table 7 are between perfective and non-perfective as well as corrective and non-corrective. We can see that most metric values are different depending on whether they are measured in perfective, corrective, or non-perfective and non-corrective commits. In the following, we discuss the differences for each measured metric value. A description for each metric and the expected direction of metric value change is shown in Table 4.

Table 7 Statistical test results for perfective and corrective commits, Mann-Whitney U test p-values (p-val) and effect size (d) with category, n is negligible, s is small, m is medium

McCC: the cyclomatic complexity of perfective changes is smaller than for non-perfective changes as well as a combination of all changes. Even when we do not account for the size of the change. This is expected as some perfective commits mention simplification of code. For perfective commits the effect size is medium. Corrective commits however have higher McCC than all commits. This can be seen in Fig. 2. The median of corrective commits is higher than for all commits. Our assumption about McCC being lower in all quality improving commits is not met in this case. While it makes sense that corrective commits add complexity, Table 7 provides a comparison of stochastic dominance between corrective and non-corrective commits, not if corrective commits remove or add McCC. Thus, this means that changes in corrective commits are more complex than those of non-corrective changes.

LLOC: the difference of LLOC is the most pronounced in our data. We find that even when we do not correct for size of the change the difference between perfective and non-perfective changes in LLOC is the most pronounced. While manually classifying the commits, we found that often code is removed because it was marked as deprecated before or it was no longer needed due to other reasons. The effect size for perfective commits is medium. For corrective commits, we can see the same result as for McCC. While we assumed that bug fixes usually add code, we did not expect them to dominate all non-corrective commits including feature additions.

NLE: the nesting level if-else is smaller in perfective commits. We expect this is due to simplification and removal of complex code. When we look at the box plot in Fig. 2 it shows a noticeable difference. This means simplification is a high priority when improving code quality in perfective commits. For corrective commits, we can see the same effect as previously seen for McCC and LLOC. The NLE is not lower but higher for corrective commits. This is more evidence for the fact that bug fixes add more complex code. There may be a timing factor involved, e.g., if bug fixes are quick fixes, they would add more complex code without a more complex refactoring which would decrease the complexity again.

NUMPAR: the number of parameters in a method is also different for perfective commits. This may be a hint of the type of perfective maintenance performed the most in perfective commits. The manual classification showed a lot of commit messages that claimed a simplification of the changed code. This metric would also be impacted by a simplification or refactoring operations. Corrective commits also show less additions in this metric, while it only has a negligible effect size it is still statistically significant. Fixing bugs seems to include some code reduction or at least less addition of parameters for methods.

CC: the clone coverage is not different for perfective commits. We would have expected that it is decreasing in perfective commits. However, it seems that clone removal is not a big part of perfective maintenance in our study subjects, which contradicts our expectation. Corrective commits contain a lower clone coverage, however. This could either be because corrective commits introduce fewer new clones than non-corrective commits or because they remove more. A possible reason for clone removal may be the correction of copy and paste related bugs.

CLOC: the comment lines of code show a difference for perfective commits and corrective commits. While we expected the CLOC to increase in both types of quality improving commits the effect size is higher in perfective commits. It seems that bug fixing operations do not add enough comment lines to show a larger difference here for corrective commits.

CD: the comment density of perfective commits is not statistically significantly different from non-perfective commits. We would have expected a difference here because perfective maintenance should include additional comments on new or previously uncommented code. We can see a difference for corrective commits here. This shows that the density of comments is also improving in bug fixing operations probably due to clarifications for parts of the code that were fixed.

AD: the API documentation metric does change in perfective and corrective commits compared to non-perfective and non-corrective commits. A reason could be that perfective commits do add API documentation to make the difference significant. Corrective changes that introduce code in our study subjects seem to almost always include API documentation, therefore we can see a difference here. However, the effect size is negligible in both cases.

NOA: the number of ancestors is lower in perfective commits as expected. This metric would be affected in simplification and clean up maintenance operations. For corrective commits we can also see a lower value, this hints at some clean up operations happening during bug fixing.

CBO: the coupling between objects is lower after perfective commits. This is expected due to class removal and subsequent decoupling of classes. For corrective commits we can also see a difference. While the effect size is negligible, there is some code clean up happening during bug fixes, e.g., NOA and CC are also lower in corrective than in non-corrective commits.

NII: the number of incoming invocations is lower in both perfective and corrective commits. However, the effect size is small in perfective and negligible in corrective commits. It seems reasonable to see a difference in this metric, because in the case of perfective commits, we have lots of source code removal. However, there are also maintenance activities which are decoupling classes which would also impact this metric. Corrective maintenance seems to involve only limited decoupling operations, also seen in CBO.

Minor: The PMD warnings of minor severity are different in both types of changes. However, we can see that the effect size is larger for perfective changes which makes sense as those warnings can be part of perfective maintenance.

Major: The PMD warnings of major severity are also different in both types of changes. We can see the difference in effect size again and we expect the reason is the same as for Minor.

Critical: The PMD warnings of critical severity are different for both types of changes. Here, the effect size is negligible for both types. However, as they are only changed in about 7% of our commits, they are not changing often regardless of commit type.

figure f

5.2 Summary RQ1

In summary, we have the following results for RQ1.

figure g

5.3 Exploratory Study

To answer our RQ2: What kind of files are the target of internal or external quality improvements? We conduct an exploratory study. We present the results which files are changed in which change category with respect to their metric values. The extracted metrics are considered on a per-change basis, i.e., we divide the metrics by the number of changed files to get an average metric value per file. We depict the average metric value per file before the change is applied in Fig. 3 as box plot. The median for each metric per file is listed in Table 8. This provides a view on the average metric values per file before a perfective or corrective change is applied.

Fig. 3
figure 3

Static source code metrics divided by the number of changed files before the change is applied. Fliers are omitted

Table 8 Median metric values per file before the change is applied

In addition to the per file metric values we include a kernel density estimation of the metric values before the change is applied in Fig. 4. In Fig. 4 the metric values are depicted per change. This provides an additional view on the differences in densities for metric values before a perfective or corrective change is applied. Figure 3 shows box plots for the metric values of files before the change is applied. We can see that, perfective changes are not necessarily applied to complex files. If we compare the median values in Table 8 we can see that perfective changes are applied to smaller, simpler files than the average or corrective change. McCC, LLOC, NLE, NUMPAR and CBO are lower for the files which receive perfective changes, while CLOC, CD, AD are higher. This means that less complex and well documented files are often the target of perfective changes. If we look at corrective changes we see that they are more complex and usually larger files. McCC, LLOC, NLE, NUMPAR, CBO, NII as well as Minor, Major and Critical are higher than all changes, or perfective changes. As we consider the metric values before the change is applied they can be considered pre-bugfix. However, when we consider our results for RQ1 the corrective changes usually increase the complexity even further.

Fig. 4
figure 4

Kernel density estimation plot of metric values for perfective and corrective categories before the change

Table 9 show the results of our statistical tests. Analogous to RQ1 we compare the difference between perfective and non-perfective as well as corrective and non-corrective. While most metric differences are statistically significant, we observe only some small effect sizes for the comment related metrics while the rest is negligible.

Table 9 Statistical test results for perfective and corrective commits regarding their average metrics before the change, Mann-Whitney U test p-values (p-val) and effect size (d) with category, n is negligible, s is small, m is medium

Figure 4 shows another perspective on our data in the form of a direct comparison of the density between perfective and corrective changes. We can see that McCC, NLE, LLOC, NUMPAR, CD, CBO, NII and Minor have a lower density for perfective than for corrective. While the differences are small they are noticeable.

figure h

6 Discussion

Our results show that size is different in both types of commits in H1. The size difference between all commits and perfective as well as corrective commits shows that both tend to be smaller than non-perfective and non-corrective commits. In case of perfective commits, code is statistically significantly more often deleted.

The differences in change size as well as the increased number of deletions for perfective commits we found for H1 confirms previous research. The studies by Mockus (2000), Purushothaman and Perry (2005) and Alali et al. (2008) found that perfective maintenance activities are usually smaller. Mockus (2000) as well as Purushothaman and Perry (2005) found that corrective maintenance is also smaller and that perfective maintenance deletes more code. Another indication that size between maintenance types is different can be seen in the work by Hönel et al. (2019), which used size based metrics as predictors for maintenance types and showed that it improved the performance of classification models.

Our results for H2 show statistically significant differences in metric measurements between perfective commits and non-perfective commits. This result indicates a confirmation of the measurements used by quality models, as the majority of metrics change as expected when developers actively improve the internal code quality. This empirical confirmation of the connection between quality metrics and developer intent is one of our main contributions and was, to the best of our knowledge, not part of any prior study. However, there are several examples of prior work that assumed this relationship.

The publications by McCabe (1976) and Chidamber and Kemerer (1994) assume that reducing complexity and coupling metrics increases software quality which is in line with our developer intents. While all metrics are included in a current ColumbusQM version (Bakota et al. 2014) because we used it as a basis, the CBO, McCC, LLOC, NOA metrics are also part of the SQUALE model (Mordal-Manet et al. 2009) AD, NLE, McCC, and PMD warnings are also part of Quamoco (Wagner et al. 2012). It seems that developers and the Columbus quality model agree with their view on software quality. We find that most of the metrics used in the quality model change when developers perceive their change as quality increasing. This is also true for most of the metrics shared with the SQUALE model and with the Quamoco quality model. However, the implementation for the metrics may differ between the models. Our work establishes that all these quality models are directly related to intended improvements of the internal code quality by the developers.

Surprisingly, we found only few statistically significant and non-negligible differences for corrective commits. Not all software metric values are changing into the expected direction for corrective commits. For example, we can see that McCC, LLOC and NLE are increasing in corrective changes compared to non-corrective commits. While we are not expecting them to decrease for every corrective commit, we assumed that in comparison to all non-corrective commits they would be decreasing. Even when considering software aging (Parnas 2001) we would expect the aging to impact all kinds of changes not just corrective changes. When we look at popular data sets used in the defect prediction domain we often find coupling, size and complexity software metrics (Herbold et al. 2022). For example, the popular (as per the literature review from Hosseini et al. (2017)) data set by Jureczko and Madeyski (2010) uses such features, but they are also common in more recent data sets, e.g., by Ferenc et al. (2020) or Yatish et al. (2019).

That the most significant difference is in the size of changes could explain various recent findings from the literature, in which size was found to be a very good indicator both for release level defect prediction (Zhou et al. 2018) and just-in-time defect prediction (Huang et al. 2017). This could also be an explanation for possible ceiling effects (Menzies et al. 2008) when such criteria are used, as the difference to non-corrective changes are relatively small. We believe that these aspects should be further considered by the defect prediction community and believe that more research is required to establish causal relationships between features and defectiveness.

While the work by Peitek et al. (2021) indicates that cyclomatic complexity may not be as indicative of code understandability as expected, we show within our work that it often changes in quality increasing commits. It seems that developers associate overall complexity as measured by McCC, NLE, NUMPAR with code that needs quality improvement. However, as we can see in the exploratory part of our study the most complex files are usually not targeted for quality increasing changes.

Our exploratory study to answer RQ2 about files that are the target of quality increasing commits reveals additional interesting data. We show that perfective maintenance does not necessarily target files that are in need of it due to high complexity in comparison to non-perfective changes. In fact, low complexity files as measured by McCC and NLE are more often part of additional quality increasing work by the developers. This may hint at problems regarding the prioritization of quality improvements in the source code. Maybe errors could have been avoided when perfective changes would have targeted more complex files. There could also be effects of different developers or a bias for perfective changes towards simpler code, this warrants future investigation. Corrective changes, in contrast to perfective changes, are applied to files which are large and complex. This was expected, however combined with the results of RQ1 this means that bugs are fixed in complex and large files and then the files get, on average, even more complex and even larger.

Future work could investigate boundary values according to our data. When we compare the median values of our measurements in Table 8 with current boundary values from PMD,Footnote 6 we may think that the PMD warning value of 80 McCC per file may be too high. A PMD warning triggered at 34 McCC per file would have warned about at least 50% of the files that were in need of a bug fix. However, lowering the boundary will also result in more warnings for files that were not target of corrective changes.

6.1 Implications for Researchers

Our results for H1 increase the validity of previous research by confirming previous results in our study on a larger data set of different projects. Our confirmation that quality increasing changes are smaller than non-perfective and non-corrective changes shows that researchers developing a change classification approach can benefit from including size based metrics.

Our results for H2 show that perfective changes reduce size and complexity metrics in comparison to non-perfective changes. Previous studies investigating refactorings also found an impact on size and complexity metrics. We are able to generalize this finding by providing results of a superset of refactoring operations, namely perfective changes. This indicates that perfective changes generally reduce size and complexity metrics. This also indicates that software quality models that use the affected metrics in their code quality estimations agree with the developers on what impacts code quality.

Increasing the external quality by fixing bugs, i.e., corrective changes, decreases the internal quality, i.e., complexity metric values. Defect prediction models may assign a higher risk to parts of the code that contained a bug before as there is an assumption of latent bugs still existing (Kim et al. 2007; Rahman et al. 2011). Our data provides a fine grained perspective by providing empirical data which shows that the code quality as measured by static source code metrics is actually decreasing.

This also has implications for researchers developing and deploying defect prediction models in practice. The fact that fixing a bug increases the risk of the file can lead to problems regarding the acceptance of the model by practitioners as they have no way of reducing the risk (Lewis et al. 2013). The results of our study could help to explain the reasons to developers. We can empirically show that fixing a bug is a complex operation that introduces even more complexity than non-corrective changes, even feature additions. According to our results, the main driver of complexity in a project are bug fixes and the only way to combat the rising complexity is perfective maintenance which should especially target large and complex files.

In our results for RQ2 we see a difference between files before corrective changes are applied and before non-corrective changes are applied. This difference is one of the sources of the predictive power of defect prediction models. However, the difference is smaller than expected. Incorporating metrics that have a larger difference in our data, e.g., comment density and API documentation into defect prediction models, may increase their prediction performance.

6.2 Implications for Practitioners

Our results for H2 suggest that, for the most part, software quality models match the expectations of the developers. If practitioners select a software quality model which uses static source code metrics that show a difference in our data they can expect that the model matches their intuition.

In combination with RQ2, our results indicate that bug fixing is the main driver of complexity in a software project and perfective changes are the main reducer of complexity. This has implications for developers. If more complex files were targeted for perfective maintenance bugs could possibly have been prevented. As fixing bugs does not decrease complexity, perfective maintenance is the best way to reduce it and combat rising complexity of the project as a whole. However, given the results for RQ2, we see that large and complex files are not the main target of perfective maintenance. This is an opportunity for improvement by shifting priorities for perfective maintenance to large and complex files. Moreover, our results indicate that a bug fix should be treated similar to technical debt regarding its negative impact on complexity metrics. To mitigate this, practitioners should be aware that it would be beneficial to clean up and simplify the code that is introduced as part of the bug fix.

7 Threats to Validity

In this section, we discuss the threats to validity we identified for our work. We discuss four basic types of validity separately as suggested by Wohlin et al. (2000) and include reliability due to our manual classification approach.

7.1 Reliability

We classify changes to a software retroactively and without the developers. This may introduce a researcher bias to the data and subsequently the results. However, this is a necessity given the size of the data and the unrestricted time frame for the sample and full data because it would not be feasible to ask developers about a couple of commits from years ago. To mitigate this threat, we perform the classification labeling according to guidelines and every change is independently classified by two researchers. We also compare our differences with a sample of changes classified by the developers themselves from Mauczka et al. (2015) and confirm that we are agreeing on most changes. In addition, we measure the inter-rater agreement between the researchers and find that it is substantial.

7.2 Construct Validity

Our definition of quality improving may be too broad. We aggregate different types of quality improvement together, e.g., improving error messages, structure of the code or readability. This may influence the changes we observe within our metric values. While these differences should be studied as well, we believe that a broad overview of generic quality improvements independent of their type has advantages. We avoid the risk of being focused only on structural improvements, i.e., due to use of generics or new Java features without missing bigger changes due to simplification of method code.

7.3 Conclusion Validity

We are reporting differences in metric value changes between perfective and corrective changes of the software development history of our study subjects. We find a difference for perfective commits and only some non-negligible, statistically significant difference for corrective commits. This could be an effect of our sample used as ground truth, however we chose to draw randomly from a list of commits in our study subjects so that our sample should be representative.

We use a deep learning model to classify all of our commits based on the ground truth we provide. This can introduce a bias or errors in the classification. We note however, that the non-negligible effect sizes for our results do not change. The quality metric evaluation of only the ground truth data is included in the Appendix and shows similar results. We note that for the small effect sizes we observe, a large number of observations are needed to show a significant difference as is demonstrated by the results in this article when compared to the ground truth.

7.4 Internal Validity

A possible threat could be tangled commits which improve quality and at the same time add a feature. We mitigate this in our ground truth, by manual inspection of the commit message of every change considered. We excluded tangled commits if it was possible to determine this by the commit message. As no automatic untangling approach is available to us and available approaches to label tangled commits already use the commit message to find tangled commits we determine that tangled commits which are not identifiable from the commit message are a minor threat.

Another threat could be a lower number of feature additions in our study subjects. Maybe feature additions happen too infrequently to influence the results, therefore, corrective commits are seen as adding more complex code than non-corrective commits. While we include some projects that are in development for a long period of time, we believe this threat is mitigated by the unrestricted time frame of our study.

Bots which commit code (Dey et al. 2020) could be a possible threat to our study. We mitigate this threat by matching our author data against the bot data set provided by Dey et al. (2020). We did not find matches for bots in our data. We were able to detect a Jenkins bot only when dropping the restriction of our case study data that a commit has to change non-test code. We also implemented the detection mechanism by Dey et al. (2020) which uses the username and email of the author of the commit, as used by Dey et al. to create their bot data set. This also yielded no bots in our data. Manual inspection of the author data yielded two bot-like accounts which turned out to be from a previous cvs2svn conversion as well as asf-sync-process which allows user patches without an account. However, the content of changes by the accounts we found are created by developers. We determine that the threat of bots in our data is low.

Missing information in a commit message could impact our results. Commits which are in our other category could still be perfective or corrective without it being apparent from the commit message. The study conducted by Tian et al. (2022) found that between 0.9% and 7.5% of commits do not contain why a change was made nor what it was that was changed. This can not be mapped to our study completely because we do not discern between why and what. Morevoer, some of what we found could map top both, e.g., simplify, clean up. We are not able to mitigate this threat as we extract the intent of the developers only from the commit message.

7.5 External Validity

We focus on a convenience sample of data consisting of Java Open Source projects under the umbrella of the Apache Software Foundation. We consider this a minor threat to external validity. The reason is that although we are limited to one organization, we still have a wide variety of different types of software in our data. We believe that this mitigates the missing variety of project patronage.

Furthermore, we only include Java projects. However, Java is used in a wide variety of projects and remains a popular language. Its age provides us with a long history of data we can utilize in this study. However, we note that this study may not generalize to all Java projects much less all software projects in other languages.

8 Conclusion

Numerous quality measurements exist, and numerous software quality models try to connect concrete quality metrics with abstract quality factors and sub factors. Although it seems clear that some static source code metrics influence software quality factors, the question of which and how much remains. Instead of relying on necessarily limited developer and expert evaluations of source code or changes we extract metrics from past changes where developers intended to increase the quality extracted from the commit message.

Within this work, we performed a manual classification of developer intents on a sample of 2,533 commits from 54 Java open source projects by two researchers independently and guided by classification guidelines. We classify the commits into three categories, perfective maintenance, corrective maintenance, or neither. We further evaluate our classification guidelines by re-classifying of a developer labeled sample. We use the manually labeled data as ground truth to evaluate and then fine tune a state-of-the-art deep learning model for text classification. The fine-tuned model is then used to classify all available commits into our categories increasing our data size to 125,482 commits. We extract static source code metrics and static analysis warnings for all 125,482 commits which allows us to investigate the impact of changes and the distribution of metric values before the changes are applied. Based on the literature, we hypothesize that certain metric values change in a certain direction, e.g., perfective changes reduce complexity. We find that perfective commits are more often removing code and generally add fewer lines. Regarding the metric measurements, we find that most metric value changes of perfective commits are significantly different to non-perfective commits and have a positive, non-negligible impact on the majority of metric values.

Surprisingly, we found that corrective changes are more complex and larger than non-corrective changes. It seems that fixing a bug increases the size, but also the complexity measured via McCC and NLE. As we compare against all non-corrective changes, we were expecting less addition of complexity as e.g., feature additions. We conclude that the process of performing a bug fix tends to add more complex code than non-corrective changes.

We find that complex files are not necessarily the primary target for quality increasing work by developers, including refactoring. To the contrary, we find that perfective quality changes are applied to files that are already less complex than files changed in non-perfective or corrective commits. Files contained in corrective changes on the other hand are more complex and usually larger than files contained in either perfective or non-corrective changes. In combination with our first result this shows that corrective changes are applied to files which are already complex and get even more complex after the change is applied.

While we explored a limited number of metrics and commits we think that this approach can be used to evaluate more metrics connected with software quality in a meaningful way and help practitioners and researchers with additional empirical data.