1 Introduction

The exponential growth of software size (Deshpande and Riehle 2008) is bringing in many challenges related to maintainability, release planning, and other software qualities. Thus, a natural demand to predict external product quality factors to foresee the future state of software has been observed. Maintainability is related to the size, complexity and documentation of software (Coleman et al. 1994). Size and complexity metrics are common among other metrics to predict software maintainability (Riaz et al. 2009). Growing software size and complexity have made it increasingly difficult to select features to be implemented in the next product release and have challenged existing assumptions and approaches for release planning (Jantunen et al. 2011).

Validating software metrics has gained importance as predicting external software qualities are becoming more demanding day by day to be able to manage future revisions of software. Researchers have proposed many validation criteria for software metrics over the last 40 years, e.g., a list of 47 criteria is reported in a systematic literature study by Meneely et al. (2013) where one of them is non-collinearity. Collinearity (also known as multicollinearity) exists between two independent features if they are linearly related. Since prediction models are often multivariate, i.e., use more than one independent feature or metric, it is important that there is no significant collinearity among the independent features. Collinearity results in two major problems (Meloun et al. 2002). First, it makes a model less useful as individual effects of the independent features on a dependent feature can no longer be isolated. Second, extrapolation is most likely be highly erroneous. Thus, El Emam and Schneidewind (2000) and Dormann et al. (2013) suggested diagnosing collinearity among the independent features for a proper interpretation of regression models.

Many studies have explored correlations between various software metrics such as McCabe’s cyclomatic complexity (Landman et al. 2016; Henry and Selig 1990; Henry et al. 1981; Tashtoush et al. 2014; Jay et al. 2009; Meulen and Revilla 2007), lines of code (Landman et al. 2016; Henry and Selig 1990; Tashtoush et al. 2014; Jay et al. 2009; Meulen and Revilla 2007), Halstead’s metrics (Henry and Selig 1990; Henry et al. 1981; Tashtoush et al. 2014; Meulen and Revilla 2007) Kafura’s information flow (Henry et al. 1981), Number of comments, Meulen and Revilla (2007), etc. Most of these studies have observed that code metrics are highly correlated. However, they do not address whether measurement types of metrics affect their correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics. A description of the measurement types used in this study are given below.

  • Cumulative: This indicates the traditional or the most common way how software code metrics are measured by cumulative sum or running sum. Here, by a revision, we indicate a commit, which is a single entry of source code in a repository. For example, if the number of total Lines Of Code (LOC) written for a project’s first three revisions are 50, 30, and 30 consecutively, the corresponding cumulative measures of ncloc for the revisions would be 50, 80, and 110.

  • Density: This measure tells us how representative a measure is within a per unit of artifact with a standard portion. Generally, the unit of density is a ratio. Within the context of code, we consider 100 LOC as a unit, the measurement unit becomes a percentage. Under this consideration, such a metric can take a value from 0 to 100. For example, the metric comment_lines_density measures lines of comments per 100 lines of code.

  • Average: This is the mean value of a measure with respect to artifacts related to a specific type. An example of such a metric is file_complexity which measures the mean complexity per file.

  • Organic: A metric that measures artifact from a single revision or two consecutive revisions without being influenced by any other revisions in the repository is organic. We have introduced the term organic as this measure has no effect from the entire list of unbounded preceding revisions like the cumulative measure. An organic metric can measure purely from a single revision, e.g., new_lines measures the lines of code (that are not comments) specific to a single revision. It can be zero in case no new code is added to a revision however, it cannot be negative (like a code churn measure). An organic metric can also measure a single revision relative to its one preceding revision. Since in this case it reflects a change or delta compared to the preceding revision, it can be positive, negative and zero.

The core idea of this study was developed while following a previous study (Mamun et al. 2017) where we focused on the domain-level correlation of metrics from four domains that are size, complexity, documentation, and duplications. In the follow-up study, we explored correlations at the metric-level and observed that the organic metrics consistently have lower correlations. Based on this observation from the follow-up study, we initiated and designed this study by grouping the code metrics based on how they are measured.

Due to the problems of collinearity when building predictive models, many studies have investigated how different metrics are correlated with each other. However, to our knowledge, no study has investigated the impact of measurement types on the correlations of software code metrics. This knowledge is fundamental to understand the metrics better. With a goal to understand the relationship between measurement types and correlations of software code metrics, this study has the following research question.

  • RQ: How measurement types affect correlations of code metrics?

This study has selected 21 open source projects from eight organizations and analyzed the source code of a total 11,874 revisions from all projects to extract code metrics. We have mined 24 software metrics classified into four categories: cumulative, density, average, and organic.

The complete revision histories of the selected projects have been analyzed using a static analysis tool to generate code metrics. The code metrics are then mined from the database for analysis. Before performing data analysis, data is explored using various visual and theoretical statistical tools. Based on the nature of data, we selected Kendall’s τ-B (a non-parametric method for correlation), for all selected projects. Motivations for selecting Kendall’s τ-B is discussed in Section 4. Correlation coefficients are divided into different sets based on their level of strength and level of significance. Based on the results up to this point, we transformed all cumulative metrics into organic metrics and ran statistical tests to determine whether there is a significant difference in correlation between these two sets.

Results of this study indicate how correlations of code metrics are influenced by their measurement types, i.e., the way they are measured. We can see whether there is a difference between intra-category correlations of metrics from the same category and inter-category correlations of metrics from different categories. Based on the data analysis, we will also report whether there is a significant difference between intra-category correlations of cumulative metrics and intra-category correlations of organic metrics. These understandings are fundamental because they can reveal whether high collinearity between code metrics are due to their measurement types. Such knowledge can be helpful in making an informed decision while selecting code metrics as features for predictive models.

In the following sections of this paper, we first discuss the methodology including design of this study, data collection procedures, nature of the collected data and data processing. Based on the nature of data observed in Section 3.3, Section 4 (data analysis method), presents a comparative discussion of applicable correlation methods and pros and cons of different measures to aggregate results from the data. Section 5 shows results and implications. Based on some results, this study performed an additional test. Retaining the actual work flow of this study, we have put the design and execution of this test in Section 5.4. This section also includes discussion, limitations and validity threats to this study. Finally, Section 6 summarizes the conclusions of this study.

2 Related Work

Software code metrics are generally known to be highly correlated as many studies have reported high correlation among various code metrics. A recent systematic literature review from 2016 (Landman et al. 2016) presents a summary of 33 articles reporting correlations between McCabe’s cyclomatic complexity (McCabe 1976) and LOC (lines of code). Henry and Selig (1990) reported correlations of five code metrics (LOC, three Halstead’s software-science metrics (N, V, and E), and McCabe’s cyclomatic complexity). They worked with code written in Pascal language and observed three correlations significantly higher that are (the values in parenthesis indicate the correlation coefficients): Halstead N - Halstead V (0.989), LOC - Halstead N (0.893), and LOC - Halstead V (0.885). Henry et al. (1981) compared three complexity metrics: McCabe’s cyclomatic complexity, Halstead’s effort, and Kafura’s information flow. Taking the UNIX operating system as a subject, they found McCabe’s cyclomatic complexity and Halstead’s effort highly correlated while Kafura’s information flow is found to be independent. On NASA’s open dataset, Tashtoush et al. (2014) studied cyclomatic complexity, Halstead complexity, and LOC metrics. They found a strong correlation between cyclomatic complexity and Halstead’s complexity similar to the study by Henry et al. (1981). LOC is observed to be highly correlated with both of these complexity metrics. Jay et al. (2009), in a comprehensive study, also explored the relationship between McCabe’s cyclomatic complexity and LOC. They worked with 1.2 million C, C++ and Java source files randomly selected from SourceForge code repository. They reported that cyclomatic complexity and LOC practically have a perfect linear relationship irrespective of programming languages, programmers, code paradigms, and software processes. Toward comparing four internal code metrics (McCabe’s cyclomatic complexity, Halstead volume, LOC, and number of comments), Meulen and Revilla (2007) used 59 specifications each containing between 111 and 11,495 small (up to 40KB file size) C/C++ programs. They observed strong correlations between LOC, Halstead volume, and cyclomatic complexity. A recent study by Landman et al. (2016) on an extensive Java and C corpora (17.6 million Java methods and 6.3 million C functions) finds no strong linear correlation between cyclomatic complexity and LOC to be considered as redundant. This finding contradicts many earlier studies including (Henry et al. 1981; Tashtoush et al. 2014; Saini et al. 2015; Jay et al. 2009; Meulen and Revilla 2007).

The studies discussed here, mostly cover McCabe’s cyclomatic complexity, Halstead’s metrics, and LOC investigating correlations between them and showing different results. However, they do not address whether measurement types of the studied metrics affect the strength of correlations which is the primary difference between these studies and our study. Rather than taking the usual way of checking correlations of code metrics, we focus on finding the reason whether the construction of code metrics (meaning how they are measured) have an influence on their correlations. Such an investigation is fundamental toward understanding collinearity of code metrics.

Zhou et al. (2009) have reported that size metrics have confounding effects on the associations between object-oriented metrics and change-proneness. On a revisited study, Gil and Lalouche (2017) reported similar results about the confounding of the size metric. Zhou et al. (2009) have elaborately explained the confounding effect and models to identify them in areas like health sciences and epidemiological research. Gil and Lalouche (2017) used normalization as a way to remove the confounding effect. While they mentioned having lower correlation coefficient for normalized metrics, they have not explicitly reported the overall difference between correlations coming from the intra-cumulative and the intra-normalized measures. They also did not report whether there exists a significant statistical difference between the two. But it is understandable as their primary focus is on the validity of metrics. Our focus, in contrast is solely toward understanding the effects of measurements on the correlations of code metrics. We want to understand how much of the collinearity come from the types of measures and how much of it exists naturally.

There have been studies toward understanding the distributions of software metrics. For example, Wheeldon and Counsell (2003), Concas et al. (2007), and Louridas et al. (2008) have investigated whether power law distributions are present among software metrics. They have reported that various software metrics follow different distributions with long fat tails. Louridas et al. (2008) have also reported correlations among eight software metrics including LOC and number of methods. They reported a high correlation between LOC, number of methods (NOM), and out degree of classes. Baxter et al. (2006) reported a similar study, however, unlike Wheeldon and Counsell (2003), have observed some metrics that do not follow the power laws. They opined, their use of a more extensive corpus compared to Wheeldon and Counsell (2003) is the reason for the difference. In addition to looking at the distributions of metrics, Ferreira et al. (2012) have attempted to establish thresholds or reference values for six object-oriented metrics. We have also looked at the statistical properties of the studied metrics including their distributions. However, we have done this as part of out methodology to find appropriate statistical methods, and this is not the main focus of this study.

Chidamber et al. (1998) have investigated six Chidamber and Kamerer (CK) metrics and reported high collinearity among three of them which are coupling between objects (CBO), response for a class (RFC), and NOM. Succi et al. (2005) have studied to what extent collinearity is present in CK metrics. They suggested to completely avoid RFC metric as an input feature for predictive models due to its high collinearity with other CK metrics. Given the problems of collinearity, Shihab et al. (2010) have proposed an approach to select metrics that are less collinear from a set of metrics. These studies have mentioned collinearity as a problem and reported collinearity among software metrics or proposed method to select metrics with low collinearity. However, they have not investigated from the perspective of measurement types influencing collinearity.

3 Methodology

We have designed this empirical study following the guideline of Runeson and Höst (2008) on designing, planning, and conducting case studies. This study is explorative with the intent to find insights about relations between code metrics with different measurement types. We have designed the study to minimize bias and validity threats and maximize the reliability of the results, which involves project selection, data extraction, data cleaning, exploring nature of the data, select appropriate statistical analysis methods based on the nature of data, and being conservative when selecting and instrumenting statistical analysis.

Data sources for this research are open source software projects, more specifically, open source Java projects on GitHub. Java is among the top three most frequently used project languages on GitHub. Since extracted data is quantitative, analysis methods used in this study are quantitative. We have followed a third-degree data collection method described by Lethbridge et al. (2005). First, the case and the context of the study are defined, followed by data sources and criteria for data collection. Assumptions for statistical methods are thoroughly checked, which involves exploration of the nature of data and cleaning of data as necessary. Regardless of the measurement types, extracted data is non-normal to the extent that meaningful transformation is not possible. Thus, we have used non-parametric statistical methods for analyzing data in this study.

3.1 Project Selection

GitHub’s search functionality was used to find candidate projects. However, due to limited capabilities of GitHub search functionality, it was not possible to perform a compound query that would fulfill all our criteria. Project selection was not randomized as we wanted to assure that selected projects have specific criteria (e.g., minimum LOC, minimum commits, etc.) and come from well-known development organizations that would not raise obvious validity questions, e.g., “project is unrepresentative because it is a classroom project by a novice programmer.” Thus, finding projects from reputed organizations was exploratory. We started by screening projects from the 14 organizations listed in GitHub’s open source organizations showcaseFootnote 1. We then explored whether other well-known organizations to our knowledge are also hosting their projects on GitHub but are not in the showcase, e.g., Apache. For each organization, we made queries to find Java projects. As we want to minimize blocking effects coming from various languages, we decided to stick with a single programming language. We selected Java as it is a top-ranked programming language on GitHub.

Crawford et al. (2002) presented various methods for classifying software projects. We take a more straightforward approach to make sure that our selected projects are representative regarding size. A studyFootnote 2 on the dataset of International Software Benchmarking Standards Group (ISBSG) classified software projects based on “Rule’s Relative Size Scale”. Measurements of this study are based on IFPUG MkII and COSMIC which are also translated into equivalent LOC. The combined distribution of all projects shows that more than 93% of the projects are between S (small) and L (large) size where S is estimated as 5300 and L as 150,000 LOC. We roughly followed this finding and selected projects in a way that project sizes are about uniformly distributed within about the range of S and L. We have projects ranging from 4059 LOC to 155,260 LOC indicating the code size of the latest revision of projects. Sizes of the projects are determined with cloc toolFootnote 3 using a bash script to extract total LOC and Java LOC. We selected 21 GitHub projects from eight software organizations where Java is tagged as the project language.

An overview of the selected projects is given in Table 1. In this table, ‘analyzed revisions’ indicates all the commits from which we collected data and which are available exclusively in the master branch of the Git repositories. ‘Total revisions’ indicates all commits available in the Git repository including the branches. Even though the selected projects are classified as Java projects, they have source code from other languages too. Thus, ‘total code’ indicates the amount of all lines of code and ‘Java code’ indicates only the lines of Java source code. The table field ‘latest commit’ points to the Head of a Git repository at the time we downloaded it. The time durations of the projects are presented in the whiskers box plot in Fig. 1 showing duration of projects from five months to 109 months with a median of 43 months. About 33% of the projects are within the 4th-quartile ranging from 57 to 109 months.

Fig. 1
figure 1

Whiskers box plot showing durations of the selected projects

Table 1 Overview of the selected projects

3.2 Data Collection and Metrics Selection

We used SonarQubeFootnote 4 to analyze revisions of the selected projects. Kazman et al. (2015) mentioned SonarQube as the de-facto source code analysis tool for automatically calculating technical debt. It has gained popularity in recent years, and Janes et al. (2017) mentioned SonarQube as the de-facto industrial tool for Software Quality Assurance. This tool is based on SQALE methodology (Letouzey and Ilkiewicz 2012). We used SonarQube version 6.1 and Sonar-Scanner version 2.8.

We run SonarQube on each revision available in the master branch of a project. Since we ignore sub-branches, the number of analyzed revisions is less than the number of total revisions as reported in Table 1. Sub-branches are eventually merged with the master branch which means, we do not lose anything except the granularity of data.

Analyzing 11,874 software revisions needs be automated. Python scripts are used to automate the process of traversing commits or revisions on the master branch of a project’s Git repository and run SonarQube tool on commits. SonarQube provides web-services covering a range of functionalities including mining analysis results and software metrics. We observed some of the metrics such as new_lines are seen on SonarQube’s web-interface, but they cannot be mined through the web-services. We later found that SonarQube computes some metrics only for the latest software revision and removes them automatically.

Since we did not find any option to stop the auto-deletion, we added triggers and additional tables into SonarQube’s SQL database to recover the deleted records. In total, 47 metrics were mined from the database classified into six major domains namely size, documentation, complexity, duplications, issues, and maintainability. The classification is based on what the metrics measure.

In our earlier study (Mamun et al. 2017), we used SonarQube’s classification of metrics and explored domain-level and metric-to-domain-level relationships. From the results of the metric-to-domain-level relationships in that study, we had the indication that metrics that measure artifacts based on individual values from each revision (i.e., organic metrics), result in lower overall correlation. Since metrics such as new_lines of type organic are inherently different from metrics such as ncloc of type cumulative concerning how they measure artifacts, it was understandable. However, as we started the follow-up study exploring the metric-level correlations, we observed that organic metrics have much lower correlations compared to other types of metrics. This observation influenced us to rethink how the metrics should be grouped for comparison. So the criteria to group the metrics changed from earlier “what they measure” to “how they measure” artifacts. We looked at the metrics classified into four domains (i.e., size, complexity, documentation, and duplications) based on “what they measure” by SonarQube. Reviewing them, we identified 24 metrics of four measurement types that are cumulative, density, average, and organic. Table 2 shows the selected metrics classified into these four measurement types along with a short description and value type, taken from the MySQL database of SonarQube 6.1 and the metric definitions page.Footnote 5

Table 2 Classification of the selected code metrics according to “How They Measure”

Table 3 shows metrics data corresponding to five revisions or commits of a project. In this table, each row represents a software revision. For project malmo, we have analyzed 295 revisions; thus, we have 295 data rows from this project with the similar structure as Table 3. Data used for this study is measured at the project-level. For example, for ncloc, all lines of Java code in the entire project is counted, for classes, all classes within the scope of a project are counted. In Table 3, the value of ncloc for the whole project is 9573 for revision 88. In the next revision (i.e., 89), ncloc becomes 9590 indicating an increase of 17 lines of code. However, the corresponding new_lines metric for this revision is 25 indicating the actual number of lines of code added to this revision disregarding possible changes or deletions of the code. All the metrics are calculated by SonarQube according to the description in Table 2. Among the four categories, density and average metrics are derived metrics, meaning, they are measured based on the cumulative metrics. Equations for constructing density and average metrics are given in Table 2.

Table 3 Metric data representing five revisions of a project

SQL code to instrument the SonarQube database, Python code to automatically analyze commits of a Git project with SonarQube, MySQL code to retrieve the desired data from the database, and the collected data used for this study are published as public datasetFootnote 6.

3.3 Exploring Nature of Data

Among different probability distribution functions, a normal distribution is more anticipated by the researchers due to its relationship with the natural phenomenon “central limit theorem.” Statistical methods are based on assumptions. After collecting data, before deciding on the type of statistical methods, researchers need to investigate the nature of the collected data. A crucial part is to check the distribution of data. If the distribution is normal, parametric statistical tests are considered. If data is non-normal and a meaningful transformation is not possible, non-parametric tests are considered.

There is no straightforward way of determining whether a particular data is normally distributed. Sample size plays a significant factor in statistical tests for normality. There are graphical and numerical methods where each of them can be either descriptive or theoretical (Park 2009). Since our selected projects have a varying number of revisions, we have chosen a combination of tests appropriate to our data as shown in Table 4.

Table 4 Considered methods for normality test

Histogram, a frequency distribution, is considered to be a useful graphical test when the number of observations is large. It is particularly helpful because it can capture the shape of the distribution given that bin size is appropriately chosen. If data is far from a normal distribution, a single look at the histogram tells us that the data is non-normal. It also gives a rough understanding of the overall nature of the distribution, such as skewness, kurtosis or the type of distribution such as bi- and multi-modal, etc. Box plots are useful to identify outliers and comparing the distribution of the quartiles. Normal Q-Q plot (quantile-quantile plot) is a graphical representation of the comparison of two distributions. If data is normally distributed, the data points in the normal Q-Q plot approximately follow a straight line. It also helps us to understand the skewness and tails of the distribution. The graphical methods help to understand the overall nature of the data quickly, but it does not provide objective criteria (Park 2009).

There are different numerical methods to evaluate whether data is normally distributed or not. Skewness and kurtosis are commonly used descriptive tools for this purpose. For a perfectly normal distribution, the statistics for these two analyses should be zero. Since this does not happen in practice, we calculate the z-score by dividing the statistic by the standard error. However, determining normality from z-score is not straightforward either. Kim (2013) discussed how the sample size can affect the z-score. Field (2009) and Kim (2013) suggested to consider different criteria for skewness and kurtosis based on the sample size. For a sample size less than 50, an absolute z-score for either of these methods should be 1.95 (corresponding to an α of 0.05); for a sample size less than 200, an absolute z-score of 3.29 (corresponding to an α of 0.001). However, for a sample size of 200 or more, it is more meaningful to inspect the graphical methods and look at the skewness and kurtosis statistics instead of evaluating the significance, i.e., z-score.

From analytical numerical methods, we consider Shapiro-Wilk and Kolmogorov-Smirnov for normality test. Shapiro-Wilk works better with sample size up to 2000 and Kolmogorov-Smirnov works with large sample sizes (Park 2009). The literature has reported different maximum values for these tests. For example, sample size over 300 might produce an unreliable result for these two tests, observed by Kim (2013), and a range of 30 to 1000 is suggested by Hair (2006). We have sample sizes from 52 to 2302 for our metrics. For large sample size, numerical methods can be unreliable; thus, Field (2009) suggested to use the graphical methods besides the numerical methods to make an informed decision about the nature of the data. We have computed all these tests in the statistical software package SPSS. It can be noted that SPSS calculates Kurtosis − 3 for Kurtosis, meaning subtracting three from the Kurtosis value.

Now, we present data of four metrics (ncloc, comment_lines_density, new_lines, and file_complexity), each from a different measurement category, from the Apache zookeeper project. Even though measures for these metrics vary in each project but we present them here so the readers have a rough idea how sample data from a project might look like. Table 5 shows the descriptive statistics for four metrics from four measurement types. The sample size is the same, i.e., 1473 for all these metrics and minimum, maximum, mean, and standard deviation values come from the whole sample. For example, among 1473 samples of new_lines, it has the minimum value 0 (i.e., a revision that adds no new lines of code) and maximum value 19,055 (i.e., a revision that adds 19,055 new lines of code). Based on these statistics, it is evident that ncloc is entirely different from new_lines. On the other hand, two metrics from density and average categories are quite similar but different from cumulative and organic metrics. The most notable number in this table is the tremendous 18,629.9 value of standard deviation for ncloc. The cumulative way of measurement and the large sample size are the key reasons for such a high standard deviation. Other cumulative metrics in our dataset also have a similar effect of cumulation on them. These descriptive statistics give us a quick overall idea about the nature of data, e.g., distribution, dispersion.

Table 5 Descriptive statistics for four example metrics from the Apache zookeeper project

Histogram, box plot, and normal Q-Q plot for each of these four metrics are correspondingly presented in Figs. 234, and 5. The organic metric new_lines (see Fig. 2) has a very high peak (in the histogram), a considerable amount of outliers (in box-plot) and a positively-skewed plot (normal Q-Q). The other metrics from the organic category have similar properties. Because of the very high peak in the distribution, the quartiles in the box-plot are not even visible. The file_complexity metric of type average (see Fig. 4) has a bimodal distribution in the histogram. The outliers in the box plot and the disconnected observed values in the normal Q-Q plots are due to the bimodal distribution of file_complexity metric. It is interesting to see in the Q-Q plot that both distributions for file_complexity are skewed in opposite directions. Many metrics in average and density, and even in the cumulative category have bimodal distributions, and some have multimodal distributions. The distributions of the average and density metrics are apparently more normal compared to the distributions of the cumulative metrics. However, all of them are still far away from a normal distribution. If we compare them all, organic metrics are distinctly different compared to the metrics from other categories. When specific types of plots are compared, e.g., all histograms of metrics from a category, plots of the organic metrics are visually observed to be more consistent to each other compared to the plots of metrics from other categories. On the other hand, when plots are compared across categories, distributions of metrics from density and average categories are observed to be more similar than others.

Fig. 2
figure 2

Histogram, box plot and normal Q-Q plot for ncloc metric (cumulative type) from the Apache zookeeper project

Fig. 3
figure 3

Histogram, box plot and normal Q-Q plot for comment_lines_density metric (density type) from the Apache zookeeper project

Fig. 4
figure 4

Histogram, box plot and normal Q-Q plot for file_complexity metric (average type) from the Apache zookeeper project

Fig. 5
figure 5

Histogram, box plot and normal Q-Q plot for new_lines metric (organic type) from the Apache zookeeper project

The depiction of the distributions and their properties by graphical methods are so much deviated from normality that we could have omitted additional tests for them. However, we perform them as part of the design of the study. Since we have a varying number of sample sizes, we got some insights about the tests. However, such observations are beyond the scope and focus of this study, thus, are not reported in this report.

Skewness and Kurtosis results are presented in Table 6. It is interesting that even though, we have a considerable sample size, none of the metrics are indicating normal distribution in the z-scores in this table, meaning data is non-normal because absolute z-scores are greater than 3.29, thus, rejecting the null hypothesis. However, compared to these four metrics, the ncloc and file_complexity from the Apache kafka project having the largest sample size of 2302 have z-scores within the limit of normality, but we reject it because the graphical tests do not show signs of a normal distribution.

Table 6 Skewness and Kurtosis check for four example metrics from the Apache zookeeper project

We considered Shapiro-Wilk and Kolmogorov-Smirnov tests reliable up to a sample size of 2000. Results from these tests are depicted in Table 7 showing a very strong indication (i.e., the significance values are less than the considered α = 0.05) of non-normal data.

Table 7 Shapiro-Wilk and Kolmogorov-Smirnov tests for four example metrics from the Apache zookeeper project

Likewise, these four metrics from the Apache zookeeper project, other metrics from these measurements categories or from other projects are also observed to be non-normal. Due to a high degree of non-normality and different distributions among the metrics make it impractical to make transformations on these metrics. Thus, we are left with the option to perform non-parametric statistical analysis methods.

3.4 Data Processing

Besides the statistical analysis to understand the nature of the extracted data, we performed manual inspections to check the data for possible anomalies especially at the boundaries meaning, at the beginning and the ending of revision data.

We observed some organic metrics for some projects have either NaN (not a number) or unusually high value for the first analyzed revision. Examples of such observations are presented in Tables 8 and 9.

Table 8 Data snippet from the first revision of project astyanax showing null value for new_lines
Table 9 Data snippet from the first three revisions of project ribbon showing a high value for new_lines

There might be several reasons why some projects start with a higher number of ncloc from the beginning as we see in the case in Table 9. It can be due to a project not tracking its code base through a version control from the beginning, or the project start with an existing code base possibly because it is an extension of another project with a separate version control. It is also observed that new_lines has a higher value than ncloc in the first revision of few projects. We removed data related to the first revisions in such cases. Since the number of removed revisions is very insignificant compared to the total number of revisions, it is less likely that this will have a major impact on this study.

4 Data Analysis Method

Since collected data is not normally distributed, non-parametric statistical methods are appropriate for this research. Spearman’s ρ correlation coefficient and Kendall’s τ correlation coefficient are two well-known non-parametric measures to assess relationships between ranked data. We carefully investigated the nature of the collected data and properties of these two measures. A measure of Kendall’s τ is based on the number of concordant and discordant pairs of the ranked data.

Calculating Spearman’s ρ by hand is much simpler than calculating Kendall’s τ because it can be done pair-wise without being dependent on the rest of the data while computing Kendall’s τ by hand is only feasible for small sample sizes because computing each data pair requires exploring the remaining data. Xu et al. (2013) reported that the time complexity of Spearman’s ρ is O(n ∗ log(n)) and for Kendall’s τ it is O(n2). In general, Spearman’s ρ results in a higher correlation coefficient compared to Kendall’s τ where the latter is generally known to be more robust and has an intuitive interpretation.

Studies from different disciplines have investigated the appropriateness of Spearman’s ρ and Kendall’s τ concerning various factors. Both measures are invariant concerning increasing monotone transformations (Kendall and Gibbons 1990). Moreover, being non-parametric methods, both measures are robust against impulsive noise (Shevlyakov and Vilchevski 2002; Croux and Dehon 2010). Croux and Dehon (2010) studied the robustness of Spearman’s ρ and Kendall’s τ through their influence function and gross-error sensitivities. Even though it is commonly known that both of these measures are robust enough to handle outliers, this study found that Kendall’s τ is more robust to handle outliers and statistically slightly more efficient than Spearman’s ρ. In a more recent study, Xu et al. (2013) investigated the applicability of Spearman’s ρ and Kendall’s τ based on different requirements. Some of their key findings report Kendall’s τ as the desired measure when the sample size is large, and there is impulsive noise in the data. Their results are based on unbiased estimations of Spearman’s ρ and Kendall’s τ correlation coefficients.

In light of the above discussion, we think, Kendall’s τ is more appropriate for software projects. We have observed outliers in many projects, and all projects have some revisions with high data values indicating outliers. This can naturally happen to any software project when existing code-base is added to a new project. Outliers in software revision data are observed and their underlying reasons are discussed in earlier research (Aggarwal 2013; Schroeder et al. 2016). The robust nature of Kendall’s τ handles such data points better when compared to Spearman’s ρ.

Kendall’s τ has three different versions that are τ-A, τ-B, and τ-C. Both τ-A and τ-B are suitable for square-shaped data meaning data with same variables on both rows and columns. τ-C is used for rectangular-shaped data tables with different sized rows and columns. A more important difference between τ-A and τ-B is that τ-B can handle tied ranks while having otherwise the same characteristics of τ-A. Since our collected data has tied ranks, we have used and measured τ-B according to the following equationFootnote 7:

$$ \tau_{b}=\frac{P-Q}{\sqrt{(P+Q+ m1_{0})(P+Q+m2_{0})}} $$

In this equation, m1, m2 are the two metrics for which we are checking correlation. We denote correlation coefficients of Kendall’s τ as τb. P, Q are numbers of concordant and discordant pairs respectively. m10 is the number of ties only in m1, and m20 is the number of ties only in m2. The possible value of τb is − 1 ≤ τb ≤ + 1 where -1 indicates perfect negative correlation, zero indicates no correlation, and + 1 indicates perfect positive correlations. Since we have 21 projects and 24 metrics in each project, we have calculated 21 correlation matrices of size 24×24. All these correlation matrices are symmetric meaning they are mirrored along the principal diagonal. τb for the diagonal elements are always + 1, indicating a perfect positive correlation of a metric with itself. Kendall’s τ also computes statistical significance (p-value) for correlation coefficient (τb).

4.1 Landscape of Correlation Coefficients (τ b)

Before start aggregating results, we need to understand τb and define concrete boundaries for the interpretation of τb at different levels. τb indicates the strength of a correlation and τb has the range − 1 ≤ τb ≤ + 1. As the value of τb approaches 0, it indicates less correlation between two metrics. As the value of τb approaches to the boundaries, i.e., -1 or + 1, it indicates a higher correlation between two metrics. Researchers have used different ranges to label the strengths of correlation coefficients (Taylor 1990). This paper has labeled τb according to Table 10.

Table 10 Grouping and labeling τb

However, it is not enough just to look at the τb values, unless we look at their statistical significance, which can be found from the p-value. If the p-value is greater than a chosen significance level, we cannot reject the null hypothesis, meaning we do not have enough evidence to differentiate that a corresponding τb is any different from τb = 0. For example, for a p-value value of 0.05, there is a possibility that 5% of the τb values are indicating correlation by chance even though there is no real correlation between the underlying metrics. This study considers α = 0.05 for any τb to be statistically significance.

Now we like to focus on the landscape of τb depicted in Fig. 6. The outer circle is the set of all τb, which is represented by \(S\tau _{b_{all}}\) containing τb from correlation matrices resulted from all projects. Similarly, the set of all significant τb is represented by \(S\tau _{b_{sig}}\) containing all τb, for which the corresponding p-value has satisfied the specified significance level of 0.05. Set of very strong τb is represented by \(S\tau _{b_{vs}}\), set of strong τb by \(S\tau _{b_{s}}\), set of moderate τb by \(S\tau _{b_{m}}\), and set of weak τb by \(S\tau _{b_{w}}\). These sets are calculated according to τb values based on the levels defined in Fig. 10. Now we can also define the set of all non-significant τb as \(S\tau _{b_{nsig}}\) where \(S\tau _{b_{nsig}}\) = \(S\tau _{b_{all}} \setminus S\tau _{b_{sig}}\).

Fig. 6
figure 6

Sets of correlation coefficients (τb). Text inside the circles indicates measures computed from τb

4.2 Aggregating Correlation Coefficients (τ b)

Now that we have 21 metrics of size 24x24, we want to aggregate meaningful data from them. Let M = {m1,m2,m3...mn}, where n is the number of total metrics be the set of all metrics and P = {p1,p2,p3...pq}, where q is the number of total projects be the set of all projects.

Now we discuss several ways to aggregate τb from all the projects.

  • Aggregating correlation coefficients (τb) based on the set of all correlation coefficients (\(S\tau _{b_{all}}\))

    The simplest way of aggregating τb is by summing up all τb values for each possible pair of metrics within M for each project within P. This results in a correlation matrix of size 24×24, i.e., the same size of a correlation matrix from a project. The advantage of this method is that we can compute a single aggregated matrix where each cell is the sum of τb values from the corresponding cells from correlation matrices generated from the projects. Such an aggregated matrix can also be transformed into a weighted average matrix by dividing each cell (containing the sum of all τb for a particular pair of metrics) by q.

    However, there is a fundamental problem with this method. This method combines all τb values irrespective of their corresponding p-value. If the p-value is greater than α then we cannot reject the null hypothesis. This means, we only consider a τb valid when its corresponding p-value is less than or equal to α. If this is not checked, there are two implications. First, a result will be wrong due to the inclusion of the non-significant τb or inclusion of τb from the set \(S\tau _{b_{nsig}}\); second, we will not have any idea how big the set \(S\tau _{b_{nsig}}\) is compared to \(S\tau _{b_{sig}}\).

    Aggregating results based on \(S\tau _{b_{all}}\) is not problematic in the case when \(S\tau _{b_{all}} = S\tau _{b_{sig}}\), meaning there is no τb with a non-significant p-value, which is an assumption that should be checked for validity in case it is assumed. For example, the recent research on the correlation of code metrics by Gil and Lalouche (2017) has not reported anything about considering α, i.e., the significance level for p-value, thus, nothing about the existence of τb with a non-significant p-value. In this case, the readers cannot know the size of \(S\tau _{b_{nsig}}\). If an assumption regarding the non-existence of non-significant p-value is considered, then that should be documented and validated. In this research, to aggregate results, we have avoided any value from the set \(S\tau _{b_{nsig}}\). To calculate the sample mean, τb values from \(S\tau _{b_{nsig}}\) are considered as zero.

  • Aggregating correlation coefficients (τb) based on the set of all significant correlation coefficients (\(S\tau _{b_{sig}}\))

    Let \(\tau _{b_{(m_{1}, m_{2}, p_{i})}}\) is the correlation coefficients for two metrics m1 and m2 in project pi. Now we compute the mean for \(S\tau _{b_{sig}}\) by the following equation:

    $$ \overline{\tau}_{b_{(m_{1}, m_{2})}} = \frac{1}{\nu} \sum\limits_{i=0}^{q} \tau_{b_{(m_{1}, m_{2}, p_{i})}} \mid \tau_{b_{(m_{1}, m_{2}, p_{i})}} \in \tau_{b_{sig}} $$
    (1)

    Since (1) is calculated based on \(S\tau _{b_{sig}}\), ν indicates the total count of \(\tau _{b_{(m_{1}, m_{2}, p_{i})}}\) from all projects. To compute the sample mean, we only have to replace ν by m in (1). Since we are unsure of the τb values within set \(S\tau _{b_{nsig}}\) due to their insignificant p-value, considering a conservative measure, we take τb from set \(S\tau _{b_{nsig}}\) as zero. Thus, \(\tau _{b_{(m_{1}, m_{2}, p_{i})}} \in S\tau _{b_{sig}}\) part in (1) still holds for the sample mean.

    Like \(S\tau _{b_{all}}\), the advantage of aggregating τb from the set \(S\tau _{b_{sig}}\) is also that we can report the results using a single matrix without having the mentioned problems of \(S\tau _{b_{all}}\) based aggregation. However, from such results, we are not able to determine whether \(\overline {\tau }_{b_{(m_{1}, m_{2})}}\) is coming from a large or small value of ν. In other words, we are unable to tell how representative \(\overline {\tau }_{b_{(m_{1}, m_{2})}}\) is among the selected projects. Since we can calculate sample mean from \(S\tau _{b_{sig}}\), we can also compute variance and standard deviation to see the overall spread of τb within the samples. However, sample mean and standard deviation of τb from \(S\tau _{b_{sig}}\) do not necessarily tell us about the distribution of τb within the four strength levels in Table 10. Standard deviation gives us an indication of the variability, but we do not have a way to know how the variability looks like regarding different strength levels of τb.

  • Aggregating correlation coefficients (τb) based on the sets \(S\tau _{b_{vs}}\), \(S\tau _{b_{s}}\), \(S\tau _{b_{m}}\), \(S\tau _{b_{w}}\)

    These four sets represent τb based on their strengths according to Table 10. These sets make \(S\tau _{b_{sig}}\) i.e., \(S\tau _{b_{vs}}\)\(S\tau _{b_{s}}\)\(S\tau _{b_{m}}\)\(S\tau _{b_{w}}\) = \(S\tau _{b_{sig}}\). We can consider two measures from these sets as listed below.

    • Count ofτb based on their level of strength: We get this measure by simply counting the number of τb within a set. This simple measure gives us a direct answer to the question, “how many projects report a certain correlation between two metrics at a certain level of strength”? Since we have 21 projects, the maximum value count ofτb can be 21 and minimum can be zero. Looking at particular values of a count ofτb for two metrics from the metrics \(S\tau _{b_{nsig}}\), \(S\tau _{b_{vs}}\), \(S\tau _{b_{s}}\), \(S\tau _{b_{m}}\), and \(S\tau _{b_{w}}\), we can understand the distribution of τb among different sets toward having a better understanding about the nature of relationships between metrics.

    • Sum ofτb based on their level of strength: Instead of taking counts, this adds all τb within the set. Since the highest value of a single τb is 1.0, the maximum value for sum ofτb can also be 21 similar to the measure count ofτb. However, sum ofτb does not tell us how many τb values are contributing for the sum, which we can get from the count ofτb. Thus, these two measures are mutually exclusive.

    These two proposed measures, count ofτb and sum ofτb, provide additional insights about correlation coefficients compared to popular statistical measures sample mean and standard deviation. In this report, when we mention the term ‘mean,’ we indicate it for a certain set, and when data from all sets are considered, we write it as ‘sample mean.’ For example, when we say ‘correlation between metrics m1 and m2 results in τb’ or ‘correlation between metrics m1 and m2 results in very strong/strong/moderate/weak τb’, we refer to τb from the sample mean table.

4.3 Missing Correlation Coefficients (τ b)

Correlation cannot be computed if either or both of the variables have constant values or missing values. In such a case, computation of correlation returns NaN (not a number) for both τb and p-value. Set SNaN contains such data. SNaN is kept disjoint from the \(S\tau _{b_{all}}\) in Fig. 6, because τb value is missing to determine the level of strength and p-value is also missing to determine the level of significance. Even though SNaN does not help us with correlation, it tells us about the nature of specific metrics.

When discussing and reporting results in the following section, we have to point to different sections of correlation matrices or derived matrices from them. To simplify the referencing, we label different sections of such matrices as shown in Table 11.

Table 11 Labeling different sections of our matrices for easy referencing

When reporting this case study, we have tried to maintain the actual flow how this research was carried out. Based on some interesting observations between inter-category correlations of 15 cumulative and three organic metrics, this case study further tested the hypothesis:

The median difference between correlations of metrics from cumulative and organictcategories equals to zero.”

This required deriving a set of 15 organic metrics denoted as organict corresponding to the 15 cumulative metrics. The specifics of designing, performing and results of the test are elaborated in Section 5.4.

5 Results

Before going into the results and discussion regarding significant correlation coefficient (i.e., τb based on significant p-value), from the sets within \(S\tau _{b_{sig}}\), we report how correlations coefficients outside \(S\tau _{b_{sig}}\) looks like to illustrate data that is not contributing to the main result.

5.1 Missing and Non-significant Correlation Coefficients (τ b)

First, we look at the small set SNaN in Fig. 6, reporting cases where computing correlation was not successful and resulted in null values (i.e., NaN) for τb and p-value for specific pairs of metrics from M as reported in Table 19. Missing τb related to the metric directories as seen in this table sourced from two projects, malmo and geometry-api-java. In both projects, the metric directories has values (8 for malmo and 3 for geometry-api-java) that remained unchanged throughout the project. The rest of the missing τb results from the project cloud, and all six metrics are related to duplications category that are duplicated_lines, duplicated_blocks, duplicated_files, duplicated_lines_density, new_duplicated_lines, and new_duplicated_blocks. Since the cloud project has no duplication related issues during the analyzed revisions, all six metrics have the value 0.

If at least one metric from a pair contains a constant value (i.e., a metric having the same value for all revisions), it is not possible to perform correlation on that pair of metrics, and this results in NaN values for τb and p-value. We observed the directories metric and metrics related to duplications have fewer levels (meaning variations) compared to other metrics.

Now, we move to the set \(S\tau _{b_{nsig}}\) as reported in Table 17. Horizontal bars in this table and all other similar tables reporting the count and sum measures are graphical representations of the corresponding cell values. The maximum value for a cell corresponding to two metrics (e.g., the cell between ncloc and classes) is q which is 21, the number of projects. Cells in the principal diagonal are omitted, and thus, kept blank. The rightmost column reporting the ‘total count’ or ‘total sum’ of a row may have a maximum possible value of 483 (calculated from n.qq). The horizontal bars are drawn based on the maximum possible value a cell can contain (i.e., 21 for regular cells between metrics and 483 for cells indicating ’total’) but not on the maximum available value in a table.

A single glance at Table 17 gives us the impression that organic category is different from all other categories. A more careful look tells us that there are three groups: cumulative and organic having lowest and highest count of non-significant τb values correspondingly, and density and average with the count of non-significant τb values in between. We can also see it from the rightmost column ‘total count’. Since ‘total count’ comes from all four metric categories, we show the category-wise mean of Tables 17 in 12.

We are interested to see the numbers in the principal diagonal of Table 12, which indicates measures coming from correlations between metrics from within a category, i.e., intra-category correlations. We see section OO has the least amount of non-significant τb followed by section CC, AA, and DD. Metrics within the organic category (section OO) are interesting as they have the least amount of non-significant τb within itself, however, organic scored highest when correlated to other categories. Another observation is that the mean value of ‘count of non-significant τb’ within a category itself is always smaller than the mean value of ‘count of non-significant τb’ between categories. Since we cannot draw any conclusion whether τb is strong or weak from τb with non-significant p-value, it is better to have less non-significant values within the context of making statistical analysis.

Table 12 Category-wise mean of count of non-significant correlation coefficients (τb) from Table 17

Key Observations:

  • Correlating metrics from different categories results in more non-significant correlation coefficients compared to correlating metrics within a category.

  • Category organic is quite different from the other three categories by producing a lot of non-significant τb for intra-category correlations.

  • Category organic has least amount of non-significant τb followed by cumulative, average, and density for inter-category correlations. Even though the mean value for cumulative category is quite low (0.27) in this context, the contrast between the inter and intra-category is clearly less noticeable compared to organic category.

5.2 Overall Distribution of Correlation Coefficients (τ b)

Based on the labeling of τb in Table 10, we have reported two sets of measurements. They are ‘count of τb’ (Tables 232425, and 26) and ‘sum of τb’ (Tables 272829, and 30) at different levels.

Figure 7 is constructed based on the rightmost column ‘total count’ from Tables 1719 232425, and 26. Similarly, Fig. 8 is constructed from the absolute values of the rightmost column ‘total sum’ from Tables 272829, and 30. Since ‘total count/’ and ‘total sum’ columns in these tables are counts and sums of corresponding measures of a metric’s correlation with respect to all other metrics, we get an overall distribution of τb for the metrics from these columns.

Fig. 7
figure 7

The overall distribution of count ofτb (correlation coefficients) at different levels with respect to all metrics

Fig. 8
figure 8

The overall distribution of sum ofτb (correlation coefficients) at different levels with respect to all metrics

Both Figs. 7 and 8 are vertically equally scaled for better comparison. Between these figures, the level ‘very strong’ has the least difference and level ‘weak’ has the highest difference among the four strength levels of τb. While Fig. 7 gives us an overview of how different sets of τb are representative, Fig. 8 shows the actual sum of τb. Since for \(S\tau _{b_{nsig}}\tau _{b}\) value is meaningless, and for SNaNτb value is missing; they are not included in Fig. 8.

In Fig. 8, we can see how the red bars, representing ‘very strong τb’ within the cumulative metrics (metrics ncloc to duplicated_files), are dominating compared to other levels. Metrics directories, duplicated_lines, duplicated_blocks, and duplicated_files are comparatively weaker in terms of ‘very strong τb’ compared to the other cumulative metrics. However, all cumulative metrics have scored much higher than metrics from other measurement categories. Metrics from the organic category have scored lowest among all metrics and categories. These two figures represent data from all correlations, e.g., the horizontal bar for ncloc comes from the correlation coefficient of ncloc with all other metrics. Thus, from these figures, we cannot determine whether there is any difference between τb’s resulting from intra-category metrics correlation and inter-category metric correlation. We study this in more detail next.

5.3 Significant Correlation Coefficients

All reported τb from this subsection passed the significance α = 0.05, which we will not mention any further for the rest of this subsection. When reporting τb, by default, we indicate the sample mean as reported in Table 13.

Table 13 Sample mean of significant correlation coefficients (τb) from the set \(S\tau _{b_{all}}\) (set of all τb). The three gray-scale cell colors indicate three levels of τb. Cells with red text indicate weak τb

First, we want to look at intra-category relations among the metrics. Table 13 shows the sample mean of τb corresponding to the set \(S\tau _{b_{all}}\), and Table 21 in the Appendix shows the mean of τb within the \(S\tau _{b_{sig}}\). Here, we want to reiterate that we have considered any non-significant τb from the set \(S\tau _{b_{all}}\) as 0.

Perfect Correlations

We are interested in the perfect correlation (i.e., τb = 1.0) reported in Tables 13 and 21 between metrics (complexity, complexity_in_classes), (complexity, complexity_in_functions), and (complexity_in_classes, complexity_in _functions) since this is an indication that these three pairs of metrics are perfectly correlated meaning both metrics within a pair measure exactly the same aspect. It can be noted that τb for these relations are not exactly 1 as reported in Tables 13 and 21. This happens because we have reported τb up to two decimal points, so anything greater or equal to 0.995 is reported as 1. To understand these relations, we counted the perfect correlation coefficients between all metrics from all projects, which is reported in Table 20 where we see five relations have perfect τb. In 20 projects the correlation between complexity and complexity_in_classes is perfect. So it is evident that the metrics complexity and complexity_in_classes are measuring the same aspect and using one of them is sufficient. Since our data is coming from Java source code, this is not a surprise because, in Java, code does not reside outside classes.

For the relation between complexity and complexity_in_functions, there exist perfect correlations in nine projects. Since we found a perfect correlation in the sample mean of τb in Table 13, it means the rest of the 12 τb for this relation must be very strong. We have also calculated ‘sum of significant τb’ reported in Table 22. The ‘sum of significant τb’ for (complexity, complexity_in_functions) is found to be 20.98 out of 21. This indicates that complexity and complexity_in_functions also measure the same aspect with a negligible difference.

Even though the results in Table 20 does not show any perfect correlation between the metrics ncloc, functions, statements, complexity, classes, files, and public_api, but we see τb greater than 0.9 meaning at the very strong level for all relations in the sample mean in Table 13. Considering any relation at the τb-level greater or equal to 0.9 redundant, we consider all these seven measures from the cumulative category as redundant. Under the same consideration, public_api is redundant to public_undocumented_api.

For relations between new_duplicated_lines and new_duplicated_blocks and between duplicated_blocks and duplicated_files, perfect correlations, are found in only one project. ‘Sum of significant τb’ for these two pairs are reported in Table 22 as 19.65 and 17.87 correspondingly. Thus, we cannot say right away that metrics within these two pairs are duplicated.

5.3.1 Intra-Category Correlations

Intra-category correlations indicate correlations within the sections CC, DD, AA, and OO where all correlations within a section are coming from metrics measured similarly.

Section CC

The sample mean of \(S\tau _{b_{all}}\) in Table 13 presents that metrics within the cumulative category (i.e., section CC) are very highly correlated to each other, which we can see from the category-wise mean of the sample mean values in Table 14. Here, section CC has a mean value of 0.79 for τb which indicates a strong correlation coefficient between any metrics for any projects. For a more detailed picture, we look at the sample mean values in Table 13 where we can see metrics are mainly divided into three parts based on the strength of correlations. The first part constitutes of nine redundant metrics (from ncloc to public_api), which we have discussed earlier, i.e., due to ‘very strong’ τb, any of these metrics can explain the variability of the other two metrics to a high degree. The second part has three metrics public_undocumented_api, comment_lines, and directories that are mostly within the τb level ‘strong’, except public_undocumented_api is redundant to public_api (as discussed in the preceding section), and the correlation between public_undocumented_api and directories is ‘medium’. The third part has duplicated_lines, duplicated_blocks, and duplicated_files that are all ‘strongly’ correlated to each other but they are correlated at a ‘medium’ level with all the metrics from the other two parts.

Table 14 Category-wise mean of sample mean of significant correlation coefficients (τb) from Table 13

Since the sample mean (i.e., from \(S\tau _{b_{all}}\)) reported in Table 13 is a more conservative measure than the mean of \(S\tau _{b_{sig}}\) in Table 21, we expect equal or better result (higher τb value) in Table 21. From the data, we see a very small or no difference for most of the mentioned metrics due to the few numbers of non-significant and missing τb in section CC.

Key Observations:

  • Metrics complexity, complexity_in_classes, and complexity_in_functions measure exactly the same aspect.

  • Metrics ncloc, functions, statements, complexity, classes, files, public_api, and including the three metrics mentioned above are all redundant.

  • Metrics public_api and public_undocumented_api are redundant.

  • Not a single correlation out of 105 total correlations within the cumulative metrics has a weak correlation coefficient.

Sections DD, AA, OO

As Table 14 shows, sections DD, AA, and OO have on average weak to moderate correlations compared to strong correlations in section CC. Correlations among all three metrics (comment_lines_density, public_documented_api_density, and duplicated_lines_density) from density category in section DD have weak τb. For section AA, the correlation between file_complexity and class_complexity from average category has a strong τb of 0.71. The other two correlations (including function_complexity) for this section have moderate τb. In section OO, the correlation between new_duplicated_lines and new_duplicated_blocks is 0.94 and, based on our 0.9 limit; these two metrics are redundant. Correlations of these two metrics with new_lines are weak.

All four sections (related to intra-category) have less non-significant τb (in Table 12) compared to other sections, and section OO has no non-significant τb. Section OO also has the lowest category-wise mean of standard deviations of τb as reported in Table 15. The individual records of standard deviations in Table 18 show that redundant metrics within section CC have lower standard deviations than other metrics from the same category. Density metrics in section DD have the highest standard deviation and we have to look into ‘count of τb’ (Tables 232425 and 26) and ‘sum of τb’ (Tables 272829 and 30) tables to fully understand how τb values are distributed in this category.

Table 15 Category-wise mean of standard deviations of significant correlation coefficients (τb) from Table 18

Intra-category correlations among cumulative metrics are much higher than density, average, and organic. Even though cumulative, average, and organic are all at the strong τb level, but for cumulative the category-wise mean of the sample mean of τb is 0.79, thus, almost close to very strong level. On the other hand, average and organic categories have scored 0.54 and 0.55 correspondingly. In addition, even though we do not see a single weak level correlation in CC, more than half (5 out of 9) correlations in sections DD, AA, and OO have weak τb.

Key Observations:

  • Metrics new_duplicated_lines and new_duplicated_blocks (in section OO) are redundant.

  • All correlations among density metrics are weak.

  • Intra-category correlations for density, average, and organic metrics result in lower τb compared to cumulative metrics.

5.3.2 Inter-Category Correlations

Inter-category correlation happens on metrics that are measured differently. We want to see whether there exist any noticeable difference in inter-category correlations compared to intra-category correlations.

Inter-category correlations are available in the six sections that are CD, CA, DA, CO, DO, and AO. These sections reflect all possible correlations, a total of 162, between metrics from all four categories. All these correlations result in weak τb except three correlations that result in moderate τb (values 0.56, 0.54, and 0.52) as shown in Table 13. Having a look at the category-wise mean in Table 14, we see that sections CO, DO, and AO, i.e., inter-category correlations of organic metrics with all other categories are the lowest. Interestingly, standard deviations are also lower for these three sections related to organic category (in Tables 18 and 15), meaning when organic metrics are correlated with metrics from other categories, the variability of τb is low. When we look at the count ofτb (Tables 232425, and 26) and sum ofτb (Tables 272829, and 30), we see that sections CO, DO, and AO have values only in Tables 26 and 30 reporting weak τb. In the remaining tables these three sections have zero. For other three sections (CD, CA, and DA), we see ‘count of τb’ and ‘sum of τb’ measures are available in all four levels and most concentrated at the moderate level.

When we look at the difference between intra- and inter-category correlation of metrics, it is clear that they are different. The grand mean of the category-wise mean (see Table 14) of intra-category correlations is 0.49 (from (0.79 + 0.09 + 0.54 + 0.55)/4), however, for inter-category correlations, it is 0.06 (from (0.08 + 0.24 + 0.01 + 0.03 − 0.01 + 0)/6).

The observation that correlation between metrics from different categories results in overall weak correlation is important because this tells us that metrics from different categories have low collinearity. Thus, software code metrics from different categories can be used together as features in models for prediction, forecast, etc.

Key Observations:

  • Intra-category correlations of metrics are different from inter-category correlations by resulting in much lower correlation coefficients.

  • Correlation of metrics from different categories results in overall weak correlation coefficient. Thus, code metrics from different categories are observed to have low correlation coefficients, thus, are non-collinear.

Overall, we have observed that cumulative metrics are different because the intra-category correlations of cumulative metrics result in higher τb values compared to correlations of metrics within other categories.

5.4 Cumulative vs. Organic Metrics

The progression of development of software can be tracked in different ways. Traditionally, we keep track of software through cumulative metrics. Cumulative metrics are intuitive in the sense that they give us an overall idea of the state of the system. The same thing, i.e., tracking the progression of software, can also be done through the organic way. Version control systems like Git keep track of revisions through saving deltas from each revision. Taking the delta from each consecutive software revisions, we can still calculate the total by a linear addition. If we have either a cumulative or an organic measure, we can calculate one from the other.

Now we have a question whether the higher τb values among cumulative metrics occur due to the cumulative way of measurement or it is just because the cumulative metrics are more correlated to each other. If we want to test this, we have to transform all the cumulative metrics in this study into organic metrics (which we denote as organict) then compute correlations and check whether there is a significant difference or not. It can be noted that in Table 2, the three organic metrics are different from the 15 cumulative metrics. Thus, we need to create a new set of organic metrics equivalent to the cumulative metrics. Since we have identified few perfectly correlated cumulative metrics, they can act as our point of validation. Meaning, perfectly correlated metrics should always be perfectly correlated no matter how they are measured. To check the difference between cumulative and organict categories, we have considered the following hypothesis:

  • Null hypothesisH0: The median difference between correlations of metrics from cumulative and organict categories equals to zero.

5.4.1 Transformed Organic Metrics

We have transformed all 15 cumulative metrics into organict metrics by taking the difference between consecutive revisions for each metric. We name these new metrics by adding an underscore before the cumulative counterpart metrics from which they are transformed, e.g., ncloc is transformed into _ncloc. It should be noted that the existing three metrics from the organic category do not take any negative values. Meaning if ncloc decreases new_lines will hold a zero value because there are no new lines. However, organict metrics can hold negative values reflecting a reduction of a metric’s measure in consecutive revisions.

5.4.2 Designing and Executing the Test

After computing the organict metrics, we went through the same procedure as we did for the other metrics in this study, i.e., checking the nature of the data. We found organict metrics to be similar to organic metrics in terms of kurtosis and short tails. However, for skewness, organict metrics are not as extremely left skewed as organic metrics because organict metrics can take negative values. However, overall, organict metrics are non-normal.

We performed Kendall’s τ for all 39 metrics and derived tables similar to the tables mentioned earlier in Section 5. The sample mean of the significant correlation coefficients is presented in full in Table 16. However, in the interest of space, we only reported newly derived table sections for standard deviation, count of non-significant τb, and count of perfect correlations in Tables 31 and 32.

Table 16 Sample mean values for significant correlation coefficients (τb) of all metrics including transformed organic metrics. Three gray-scale cell colors indicate three levels of τb. Cells with red text indicate weak τb. A dot (.) in a cell indicates zero value

5.4.3 Results from the Test

In Tables 1631, and 32, we see that organict metrics are similar to organic metrics both in terms of intra- and inter-category correlations.

Now we like to see whether organict metrics are able to produce perfect correlations as produced by the cumulative metrics. Following the similar referencing style as per Table 11, we refer the section containing τb from intra-category correlations of organict metrics as OtOt. Intra-category correlation of cumulative metrics (in Section CC of Table 20) have a total 46 counts of perfect correlations for six correlations. For organict metrics, we see (in section OtOt of Table 32) 24 perfect correlations for four metrics are exactly produced. However, for two correlations ((complexity, complexity_in_functions) and (complexity_in_classes, complexity_in_functions)) 12 out of 22 perfect correlations are produced. So we look at the sample mean of these two correlations (see section OtOt of Table 16) and find a value 0.994 for both of these correlations; the value 0.994 is so close to the maximum possible τb value of 1 that we can consider 0.994 as a perfect correlation. Thus, we see that both cumulative and organict metrics are equally able to detect perfect correlations among metrics if there exist any.

At this point, we like to focus on testing the null hypothesis. The mean of section OtOt of Table 16 is calculated as 0.49 which is much lower than the value of 0.79 for the cumulative metrics (in section CC of Table 14). However, without performing a statistical test, we cannot accept or reject our null hypothesis.

Earlier our data was the code metrics, but now it is the sample mean of the Kendall’s τ. So now we need to check the distributions of data to be tested, i.e., section CC and OtOt of Table 16. We have checked descriptive statistics, skewness, kurtosis, histograms, and also performed the Shapiro-Wilk test on these two data sets and found the data non-normal. Thus, we have to choose non-parametric tests.

We can consider the data as paired because a single τb value, say ncloc in section CC and a corresponding τb value from section OtOt (i.e., _ncloc) both measure the similar aspect but they are measured differently. Then it can also be argued that ncloc and _ncloc are two different measures and they cannot be considered as paired. Taking both arguments, we would like to execute two tests to see whether there is any significant difference. Taking our two data sets as two independent samples, we perform the ‘Mann-Whitney U Test’, and taking our data sets as dependent samples, we perform the ‘Wilcoxon Signed Ranks Test’. After checking the assumptions of these tests, we remove two τb from section CC and the two corresponding τb from section OtOt. Since the correlations between metrics complexity, complexity_in_classes, complexity_in_functions in section CC and corresponding organict measures from section OtOt are commonly identified as perfectly correlated in both sections, we decide to keep only one of them and remove the other two so that the assumption related to data dependency is no longer present. The ‘Wilcoxon Signed Ranks Test’ has an assumption that all the difference between paired data should approximately be equally distributed along the quartiles when plotted as a one-dimensional box-plot. This was not met for our data. Since this assumption is a visual test, we decided to include a ‘Paired-Samples Sign Test’, which does not have such an assumption. We report results from all these three tests here.

We have a total 103 pairs. From the test ranks, we see that for both ‘Wilcoxon Signed Ranks Test’ and ‘Paired-Samples Sign Test’ count of negative difference is 0, count of positive difference is 102, and count of ties is one, taking CC as the first and OtOt as the second variable. These numbers already tell us without looking at the significance levels, that the organict category is lower than the cumulative category. From Figs. 9 and 10, we see that all the three tests show a p-value less than 0.01 (i.e., considering an α = 0.01). Thus, we can reject the null hypothesis and accept the alternative hypothesis, i.e., the median difference between correlations of metrics from cumulative and organict categories is not equal to zero.

Fig. 9
figure 9

Statistical test considering cumulative and organict as two independent samples. Sub-figure 9a shows test statistics for ‘Mann-Whitney U Test’ and 9b graphically shows distributions of τb values for metrics within cumulative and organict categories

Fig. 10
figure 10

Statistics for different statistical tests to determine the existence of significant differences between correlations from cumulative and organict

5.4.4 Implications of the Test Results

The finding that there exists a significant median difference between correlations of metrics from cumulative and organic categories, implies that organic metrics are a set of measures that are collectively different than the cumulative metrics. Therefore, organic metrics can be considered as a new set of feature holding different characteristics than cumulative metrics as a whole.

The intra-category correlations of cumulative metrics are much higher than their equivalent sets of transformed organic metrics. Since correlations between cumulative metrics are high, there exist high collinearity among these metrics. This makes cumulative metrics collinear, and only a single metric from a group of highly correlated metrics can be considered as a valid input feature for a predictive model. The high collinearity among software code metrics is not new information. However, the knowledge that transforming cumulative metrics into organic can significantly reduce the collinearity is new. Since this transformation does not alter the original footprint of how software is evolved, it is expected to be free of side effects of normalization. Since organic metrics have lower collinearity, the chance of having multiple valid input features from them is possible.

The inter-category correlations of metrics from different categories are observed to be always low. This information can improve feature engineering by making the process more systematic. First, this gives a simple and intuitive understanding of non-collinearity based on measurement types. Second, this implies transforming a feature into a different measurement type can produce a feature that is non-collinear with the original feature.

5.5 Discussion

Software engineering researchers have observed high collinearity among software code metrics. However, we did not have explicit knowledge whether the high collinearity is due to the inherent nature of the code metrics or due to how we measure them. This study compares between correlation coefficients of a set of 15 cumulative metrics and correlation coefficients of their corresponding organic metrics and reveals that a large portion of collinearity among cumulative metrics results from the cumulative way of measurement. Since organic metrics are free from the effects of cumulation representing the natural evolution of software, we think, correlation coefficients among the organic metrics represent the inherent collinearity among code metrics. Taking the difference between the intra-correlation coefficients of metrics from these two categories, we can determine the added collinearity due to cumulative measurement.

High collinearity among a group of input features makes them weaker as a whole. We can get a few valid input features from such a set because, during the validation process, features with high collinearity are removed. Since organic metrics have lower collinearity among themselves, we can possibly get more valid input features from them. Moreover, since the collinearity between cumulative and organic metrics are very low, we can combine them to have even more valid input features. The lower collinearity among cumulative and organic metrics also means, even though we can calculate a set of organic metrics corresponding to a set of cumulative metrics, they are mutually exclusive. Therefore, theoretically, we do not have to restrict our choice of metrics from either of these two categories.

It is interesting that when we add density and average categories to the scenario described above for cumulative and organic, still non-collinearity holds between metrics from different measurement categories. It can be noted that the unit of measurement of the density metrics is percentage. While cumulative and organic metrics have the same value type (i.e., integer), metrics are measured differently in both categories.

The findings that organic metrics are collectively different than their corresponding cumulative metrics (i.e., inter-category), the understanding of the effect of cumulation toward collinearity among metrics (i.e., intra-category), and the general observation that metrics from different measurement categories yield in overall weak correlation coefficients are significant for the researchers and the practitioners in software engineering because they help us better understand the nature of the code metrics. The findings open the possibility of new research and revisiting existing research in this field. Based on our results, theoretically, we have more valid input features from different measurement categories, however, in practice, research needs to be conducted to determine which predictors (i.e., input metrics) are good for which targets (i.e., qualities that we want to predict or estimate). It could be the case that specific metrics from a category work better in combination with other metrics to predict a particular quality attribute. Researchers can also try to find new measurement categories and their properties. This study has investigated some code metrics, other code metrics not covered here can also be studied. These results can help the practitioners by giving them more insights about software metrics. Quality managers can re-prioritize the metrics that a project keeps track. Tool developers can rethink about the metrics their tools support. For example, SonarQube automatically removes new_duplicated_lines, new_lines, new_duplicated_blocks, and some other similar organic metrics. Based on the findings of this study, they can decide to give users the option to keep track of such metrics. Software engineers working with predictive analysis have more options when choosing input features for their models. For example, for a predictive model, complexity (of type cumulative) is identified as an input feature. At this point, the possibility of incorporating metrics related to complexity from different measurement categories can be explored and complexity per file (of type average) and complexity per commit or based on a specific time duration (of type organic) can be considered as input features.

5.6 Threats to Validity and Limitations

5.6.1 Internal Validity

Today software is built with various languages and it is very common that a project uses code from different languages. Still, a project is usually designated with a specific language indicating the major portion of code, etc. Computer programming languages use different constructs and measures of code metrics may dramatically vary due to language difference. To eliminate this effect, we choose to focus on a single programming language, Java. It can be noted that Java projects are among the top three ranks on GitHub.

We considered some other factors when selecting projects such as project types, project size in LOC, number of revisions, number of developers. While selecting the projects, we tried to combine these factors so that we have a good representation regarding these factors. Besides, we looked at the number of issues and pull requests. We tried to select projects that have reported issues and pull requests because these factors are signs of active involvements of users and developers.

Tools measuring software code metrics are not perfect. Different tools implement measures differently even though they claim to measure a same aspect of code (Lincke et al. 2008). This study has selected one of the most widely used measurement tools for software quality, and we have observed very strong correlations between the cumulative metrics similar to the major studies. However, selecting a popular tool and similar observations to the major studies do not entirely remove this threat. This is a general issue to any study like this and we are aware of this.

We mentioned earlier that we only checked the revisions from the master branch of a Git repository. This affects granularity of the collected data. Based on the finding in this study, we know that reducing the cumulative effect reduces the correlation between metrics, thus, avoiding partial cumulation of data due to Git branches could possibly make the result of this study more stronger. Therefore, we do not consider this a major threat to our results.

5.6.2 External Validity

There are millions of software projects hosted on GitHub. Generalizing result for such a large population is a key validity threat for any study and we also identified generalizability as a considerable threat to validity. Selection of project is one of the most important things to minimize this threat. We tried to carefully select projects from well-known organizations to mitigate this risk. For example, we have included projects from Apache, a pioneering organization in the open source community, Microsoft, and other organizations with diverse portfolios. On the positive side, we have found highly significant results from different tests. For example, Mann-Whitney U Test in Fig. 9a, Wilcoxon Signed Ranks Test in Fig. 10a, and Paired-Sample Sign Test in Fig. 10b have significance values 4.4e− 18, 1.8e− 18, and 1.5e− 23 correspondingly.

To safeguard the internal validity, we choose to restrict our focus to Java source code. This choice, however, seems to affect the external validity, meaning ‘do the results hold for source code written in other programming languages?’. Since measurement types discussed in this study are independent of the programming languages, such a threat is not highly significant, in our opinion. However, more research can be done to be certain.

6 Conclusions

This empirical research investigates whether measurement types of software code metrics have an effect on their correlations. Through collecting and analyzing 24 code metrics from 11,874 revisions from 21 open source Java projects, we have found that measurement types have an effect on correlations. Analysis of data shows that 10 out of total 15 metrics that are measured cumulatively are redundant based on our criteria two metrics with a correlation coefficient of 0.9 or above are redundant. When the cumulative effect of these metrics is removed by transforming these 15 metrics into type organic, only three of them are identified as redundant. These three metrics are identified as perfectly correlated (i.e., they measure exactly the same aspect) in both categories, implying if metrics are truly correlated, correlations of their organic measures are able to identify it. In addition, our analysis shows that organic metrics result in significantly lower correlation coefficients compared to cumulative metrics for intra-category correlations. Furthermore, while some software metrics are closely related to each other resulting in high correlation coefficients to an extent to be considered redundant, many higher correlation coefficient values are due to measuring these metrics cumulatively. In other words, we should not be surprised seeing higher correlation coefficients for cumulative metrics, and we should be aware that measuring metrics by their natural development, i.e., organically result in much lower correlation coefficient.

Another interesting finding is correlations between metrics from different categories yield in overall weak correlation coefficients. This finding is important because metrics from cumulative, density, average, and organic categories can be combined together as features for predictive models. From another view point, this could improve the process of feature engineering by providing the information that transforming a feature into a different measurement type produces a new feature that is non-collinear with the original feature.

We have discussed why Kendall’s τ version B fits more for software projects. We also discussed the landscape of correlation coefficients outlining possible sets considering both correlation coefficient and p-value which can be helpful for this type of study.

This study has attempted to reveal the fundamental relationships between measurement types and correlations of software code metrics. More evidence is required to generalize and extend this knowledge. Thus, replicated studies can be conducted considering various metrics, measurement tools, programming languages, and project types.